JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

by   Kazuaki Matsumura, et al.

The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers. This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.


page 2

page 6

page 8

page 10


HSTREAM: A directive-based language extension for heterogeneous stream computing

Big data streaming applications require utilization of heterogeneous par...

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

GPU runtimes are historically implemented in CUDA or other vendor specif...

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) ...

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing

This paper introduces a novel approach to automatic ahead-of-time (AOT) ...

DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime

GPUs are readily available in cloud computing and personal devices, but ...

The OoO VLIW JIT Compiler for GPU Inference

Current trends in Machine Learning (ML) inference on hardware accelerate...

Demystifying the MLPerf Benchmark Suite

MLPerf, an emerging machine learning benchmark suite strives to cover a ...

Please sign up or login with your details

Forgot password? Click here to reset