DRACO: Robust Distributed Training via Redundant Gradients

03/27/2018
by   Lingjiao Chen, et al.
0

Distributed model training is vulnerable to worst-case system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To tolerate node failures and adversarial attacks, recent work suggests using variants of the geometric median to aggregate distributed updates at the PS, in place of bulk averaging. Although median-based update rules are robust to adversarial nodes, their computational cost can be prohibitive in large-scale settings and their convergence guarantees often require relatively strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are then used by the parameter server to eliminate the effects of adversarial updates. We present problem-independent robustness guarantees for DRACO and show that the model it produces is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times to orders of magnitude faster than median-based approaches.

READ FULL TEXT
research
07/29/2019

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

To improve the resilience of distributed training to worst-case, or Byza...
research
08/20/2015

AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization

We study distributed stochastic convex optimization under the delayed gr...
research
06/10/2020

Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization

Distributed optimization is vital in solving large-scale machine learnin...
research
08/05/2021

Aspis: A Robust Detection System for Distributed Learning

State of the art machine learning models are routinely trained on large ...
research
09/21/2014

Distributed Robust Learning

We propose a framework for distributed robust statistical learning on b...
research
02/12/2023

Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Modern ML applications increasingly rely on complex deep learning models...
research
04/01/2022

Robust and Efficient Aggregation for Distributed Learning

Distributed learning paradigms, such as federated and decentralized lear...

Please sign up or login with your details

Forgot password? Click here to reset