VirtualFlow: Decoupling Deep Learning Model Execution from Underlying Hardware

09/20/2020
by   Andrew Or, et al.
0

State-of-the-art deep learning systems tightly couple model execution with the underlying hardware. This coupling poses important challenges in a world where the scale of deep learning workloads is growing rapidly: workloads with high resource requirements are inaccessible to most users, experimentation on smaller test beds is impossible, and results are difficult to reproduce across different hardware. We propose VirtualFlow, a novel system approach leveraging virtual node processing to decouple model execution from the hardware. In each execution step, the batch is divided and processed with data parallelism on many virtual nodes instead of physical devices (GPUs, TPUs), and the gradients are aggregated and applied to the model after all virtual nodes finish processing their data. With multiple virtual nodes mapped to each device, the system allows users to run models at much larger batch sizes that would otherwise exceed the memory limits of the underlying physical resources. VirtualFlow significantly improves model training reproducibility across different hardware, and enables models to run on shared clusters with dynamically changing resources for better efficiency. Our implementation of VirtualFlow enables virtual node processing with elasticity for TensorFlow. Evaluation with representative deep learning models (ResNet, BERT, Transformer) demonstrates strong convergence guarantees on different hardware with out-of-the-box hyperparameters, and up to 48 completion times with resource elasticity.

READ FULL TEXT
research
01/08/2019

CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers

Deep learning models are trained on servers with many GPUs, and training...
research
10/14/2019

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, incl...
research
10/25/2019

An End-to-End HW/SW Co-Design Methodology to Design Efficient Deep Neural Network Systems using Virtual Models

End-to-end performance estimation and measurement of deep neural network...
research
12/05/2018

ADARES: Adaptive Resource Management for Virtual Machines

Virtual execution environments allow for consolidation of multiple appli...
research
10/21/2020

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

In the past decade, we have witnessed a dramatically increasing volume o...
research
03/19/2021

Performance Analysis of Deep Learning Workloads on a Composable System

A composable infrastructure is defined as resources, such as compute, st...
research
01/20/2021

Thread Evolution Kit for Optimizing Thread Operations on CE/IoT Devices

Most modern operating systems have adopted the one-to-one thread model t...

Please sign up or login with your details

Forgot password? Click here to reset