Characterizing Deep Learning Training Workloads on Alibaba-PAI

10/14/2019
by   Mengdi Wang, et al.
0

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62 among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60 to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.

READ FULL TEXT

page 1

page 2

page 4

research
05/21/2019

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

This work examines the performance of leading-edge systems designed for ...
research
01/28/2020

Characterizing and Understanding GCNs on GPU

Graph convolutional neural networks (GCNs) have achieved state-of-the-ar...
research
09/20/2020

VirtualFlow: Decoupling Deep Learning Model Execution from Underlying Hardware

State-of-the-art deep learning systems tightly couple model execution wi...
research
03/17/2023

VPU-EM: An Event-based Modeling Framework to Evaluate NPU Performance and Power Efficiency at Scale

State-of-art NPUs are typically architected as a self-contained sub-syst...
research
03/24/2020

SOL: Effortless Device Support for AI Frameworks without Source Code Changes

Modern high performance computing clusters heavily rely on accelerators ...
research
09/22/2022

Deep Lake: a Lakehouse for Deep Learning

Traditional data lakes provide critical data infrastructure for analytic...
research
12/07/2019

Dissecting the Graphcore IPU Architecture via Microbenchmarking

This report focuses on the architecture and performance of the Intellige...

Please sign up or login with your details

Forgot password? Click here to reset