Check-N-Run: A Checkpointing System for Training Recommendation Models

10/17/2020
by   Assaf Eisenman, et al.
0

Checkpoints play an important role in training recommendation systems at scale. They are important for many use cases, including failure recovery to ensure rapid training progress, and online training to improve inference prediction accuracy. Checkpoints are typically written to remote, persistent storage. Given the typically large and ever-increasing recommendation model sizes, the checkpoint frequency and effectiveness is often bottlenecked by the storage write bandwidth and capacity, as well as the network bandwidth. We present Check-N-Run, a scalable checkpointing system for training large recommendation models. Check-N-Run uses two primary approaches to address these challenges. First, it applies incremental checkpointing, which tracks and checkpoints the modified part of the model. On top of that, it leverages quantization techniques to significantly reduce the checkpoint size, without degrading training accuracy. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models at Facebook, and thereby significantly improve checkpoint capabilities while reducing the total cost of ownership.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2023

MTrainS: Improving DLRM training efficiency using heterogeneous memories

Recommendation models are very large, requiring terabytes (TB) of memory...
research
05/04/2021

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Deep learning recommendation systems at scale have provided remarkable g...
research
10/17/2022

A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

Recommendation systems are of crucial importance for a variety of modern...
research
11/05/2020

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

The paper proposes and optimizes a partial recovery training system, CPR...
research
04/11/2022

A note on occur-check (extended report)

We weaken the notion of "not subject to occur-check" (NSTO), on which mo...
research
07/05/2023

An Equivalent Graph Reconstruction Model and its Application in Recommendation Prediction

Recommendation algorithm plays an important role in recommendation syste...
research
08/20/2021

Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

The data ingestion pipeline, responsible for storing and preprocessing t...

Please sign up or login with your details

Forgot password? Click here to reset