Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs

by   Alexandros Kouris, et al.

With deep neural networks (DNNs) emerging as the backbone in a multitude of computer vision tasks, their adoption in real-world consumer applications broadens continuously. Given the abundance and omnipresence of smart devices, "smart ecosystems" are being formed where sensing happens simultaneously rather than standalone. This is shifting the on-device inference paradigm towards deploying centralised neural processing units (NPUs) at the edge, where multiple devices (e.g. in smart homes or autonomous vehicles) can stream their data for processing with dynamic rates. While this provides enhanced potential for input batching, naive solutions can lead to subpar performance and quality of experience, especially under spiking loads. At the same time, the deployment of dynamic DNNs, comprising stochastic computation graphs (e.g. early-exit (EE) models), introduces a new dimension of dynamic behaviour in such systems. In this work, we propose a novel early-exit-aware scheduling algorithm that allows sample preemption at run time, to account for the dynamicity introduced both by the arrival and early-exiting processes. At the same time, we introduce two novel dimensions to the design space of the NPU hardware architecture, namely Fluid Batching and Stackable Processing Elements, that enable run-time adaptability to different batch sizes and significantly improve the NPU utilisation even at small batch sizes. Our evaluation shows that our system achieves an average 1.97x and 6.7x improvement over state-of-the-art DNN streaming systems in terms of average latency and tail latency SLO satisfaction, respectively.


On the fly Deep Neural Network Optimization Control for Low-Power Computer Vision

Processing visual data on mobile devices has many applications, e.g., em...

Efficient Computer Vision on Edge Devices with Pipeline-Parallel Hierarchical Neural Networks

Computer vision on low-power edge devices enables applications including...

LegoDNN: Block-grained Scaling of Deep Neural Networks for Mobile Vision

Deep neural networks (DNNs) have become ubiquitous techniques in mobile ...

MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge

Cascade systems comprise a two-model sequence, with a lightweight model ...

Adaptive Scheduling for Edge-Assisted DNN Serving

Deep neural networks (DNNs) have been widely used in various video analy...

Orloj: Predictably Serving Unpredictable DNNs

Existing DNN serving solutions can provide tight latency SLOs while main...

Anytime Neural Network: a Versatile Trade-off Between Computation and Accuracy

Anytime predictors first produce crude results quickly, and then continu...

Please sign up or login with your details

Forgot password? Click here to reset