Automatic acceleration of Numpy applications on GPUs and multicore CPUs

by   Mahesh Ravishankar, et al.

Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bio-informatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each operation needs to be stored in memory which increases the memory footprint of the application. It also increases the bandwidth requirements since all uses must read from this memory. We propose an approach that records the sequence of Numpy operations for defered execution. When the values of an array are needed, for example when the values are stored to disk or displayed on screen, the sequence of operations required to compute these value are compiled into a function and executed. This removes the need to store/load intermediates in slow memory, resulting in better performance. In cases where the library implementation is more efficient (like matrix-matrix multiply), those are used instead. The approach also allows us to seamlessly target both multicore CPUs and NVIDIA GPUs, thereby porting the Numpy application to these architectures without changing the user program. The benefit of the approach is evaluated by targeting computation samples from various domains and on average on order of magnitude performance improvement over Numpy is observed.


page 1

page 2

page 3

page 4


SoaAlloc: Accelerating Single-Method Multiple-Objects Applications on GPUs

We propose SoaAlloc, a dynamic object allocator for Single-Method Multip...

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely use...

Bandwidth-Optimal Random Shuffling for GPUs

Linear-time algorithms that are traditionally used to shuffle data on CP...

On Consistency for Bulk-Bitwise Processing-in-Memory

Processing-in-memory (PIM) architectures allow software to explicitly in...

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform ...

Fast convolution kernels on pascal GPU with high memory efficiency

The convolution computation is widely used in many fields, especially in...

STeP-CiM: Strain-enabled Ternary Precision Computation-in-Memory based on Non-Volatile 2D Piezoelectric Transistors

We propose 2D Piezoelectric FET (PeFET) based compute-enabled non-volati...

Please sign up or login with your details

Forgot password? Click here to reset