LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

by   Je Yang, et al.

Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.


page 1

page 3

page 5


Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training

The success of DNN pruning has led to the development of energy-efficien...

SpOctA: A 3D Sparse Convolution Accelerator with Octree-Encoding-Based Map Search and Inherent Sparsity-Aware Processing

Point-cloud-based 3D perception has attracted great attention in various...

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

The continued growth in the processing power of FPGAs coupled with high ...

Efficient Recurrent Neural Networks using Structured Matrices in FPGAs

Recurrent Neural Networks (RNNs) are becoming increasingly important for...

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

To accelerate inference of Convolutional Neural Networks (CNNs), various...

RFC-HyPGCN: A Runtime Sparse Feature Compress Accelerator for Skeleton-Based GCNs Action Recognition Model with Hybrid Pruning

Skeleton-based Graph Convolutional Networks (GCNs) models for action rec...

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

Personalized recommendations are the backbone machine learning (ML) algo...

Please sign up or login with your details

Forgot password? Click here to reset