Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

by   Gregor Daiß, et al.

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue.


From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Meeting both scalability and performance portability requirements is a c...

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Experience shows that on today's high performance systems the utilizatio...

A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

Characterizing compute kernel execution behavior on GPUs for efficient t...

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific com...

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

HipMCL is a high-performance distributed memory implementation of the po...

Asynchronous Execution of Heterogeneous Tasks in AI-coupled HPC Workflows

Heterogeneous scientific workflows consist of numerous types of tasks an...

From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types

Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a co...

Please sign up or login with your details

Forgot password? Click here to reset