Subgraph Stationary Hardware-Software Inference Co-Design

by   Payman Behnam, et al.

A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency-accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI-an inference serving stack. For the stream of queries, SUSHI yields up to 25 improvement in latency, 0.98 to 78.7


Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Tremendous success of machine learning (ML) and the unabated growth in M...

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Machine learning (ML) has become increasingly important and performance-...

The Framework Tax: Disparities Between Inference Efficiency in Research and Deployment

Increased focus on the deployment of machine learning systems has led to...

Accelerating Deep Learning Inference via Learned Caches

Deep Neural Networks (DNNs) are witnessing increased adoption in multipl...

Desiderata for next generation of ML model serving

Inference is a significant part of ML software infrastructure. Despite t...

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML)...

INFaaS: Managed & Model-less Inference Serving

The number of applications relying on inference from machine learning mo...

Please sign up or login with your details

Forgot password? Click here to reset