Polynomially Coded Regression: Optimal Straggler Mitigation via Data Encoding

05/24/2018
by   Songze Li, et al.
0

We consider the problem of training a least-squares regression model on a large dataset using gradient descent. The computation is carried out on a distributed system consisting of a master node and multiple worker nodes. Such distributed systems are significantly slowed down due to the presence of slow-running machines (stragglers) as well as various communication bottlenecks. We propose "polynomially coded regression" (PCR) that substantially reduces the effect of stragglers and lessens the communication burden in such systems. The key idea of PCR is to encode the partial data stored at each worker, such that the computations at the workers can be viewed as evaluating a polynomial at distinct points. This allows the master to compute the final gradient by interpolating this polynomial. PCR significantly reduces the recovery threshold, defined as the number of workers the master has to wait for prior to computing the gradient. In particular, PCR requires a recovery threshold that scales inversely proportionally with the amount of computation/storage available at each worker. In comparison, state-of-the-art straggler-mitigation schemes require a much higher recovery threshold that only decreases linearly in the per worker computation/storage load. We prove that PCR's recovery threshold is near minimal and within a factor two of the best possible scheme. Our experiments over Amazon EC2 demonstrate that compared with state-of-the-art schemes, PCR improves the run-time by 1.50x 2.36x with naturally occurring stragglers, and by as much as 2.58x 4.29x with artificial stragglers.

READ FULL TEXT
research
10/27/2017

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferen...
research
01/31/2018

On the Optimal Recovery Threshold of Coded Matrix Multiplication

We provide novel coded computation strategies for distributed matrix-mat...
research
07/10/2023

Coded Distributed Image Classification

In this paper, we present a coded computation (CC) scheme for distribute...
research
06/02/2020

Age-Based Coded Computation for Bias Reduction in Distributed Learning

Coded computation can be used to speed up distributed learning in the pr...
research
05/16/2021

LocalNewton: Reducing Communication Bottleneck for Distributed Learning

To address the communication bottleneck problem in distributed optimizat...
research
01/21/2020

Serverless Straggler Mitigation using Local Error-Correcting Codes

Inexpensive cloud services, such as serverless computing, are often vuln...
research
04/10/2020

Coded Secure Multi-Party Computation for Massive Matrices with Adversarial Nodes

In this work, we consider the problem of secure multi-party computation ...

Please sign up or login with your details

Forgot password? Click here to reset