Towards Geo-Distributed Machine Learning

03/30/2016
by   Ignacio Cano, et al.
0

Latency to end-users and regulatory requirements push large companies to build data centers all around the world. The resulting data is "born" geographically distributed. On the other hand, many machine learning applications require a global view of such data in order to achieve the best results. These types of applications form a new class of learning problems, which we call Geo-Distributed Machine Learning (GDML). Such applications need to cope with: 1) scarce and expensive cross-data center bandwidth, and 2) growing privacy concerns that are pushing for stricter data sovereignty regulations. Current solutions to learning from geo-distributed data sources revolve around the idea of first centralizing the data in one data center, and then training locally. As machine learning algorithms are communication-intensive, the cost of centralizing the data is thought to be offset by the lower cost of intra-data center communication during training. In this work, we show that the current centralized practice can be far from optimal, and propose a system for doing geo-distributed training. Furthermore, we argue that the geo-distributed approach is structurally more amenable to dealing with regulatory constraints, as raw data never leaves the source data center. Our empirical evaluation on three real datasets confirms the general validity of our approach, and shows that GDML is not only possible but also advisable in many scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2019

Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

The usability and practicality of any machine learning (ML) applications...
research
04/11/2019

Robust Coreset Construction for Distributed Machine Learning

Motivated by the need of solving machine learning problems over distribu...
research
08/30/2019

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

When the data is distributed across multiple servers, efficient data exc...
research
05/03/2021

OCTOPUS: Overcoming Performance andPrivatization Bottlenecks in Distributed Learning

The diversity and quantity of the data warehousing, gathering data from ...
research
02/08/2021

Communication-efficient k-Means for Edge-based Machine Learning

We consider the problem of computing the k-means centers for a large hig...
research
01/09/2023

Distributed Sparse Linear Regression under Communication Constraints

In multiple domains, statistical tasks are performed in distributed sett...
research
02/22/2018

SparCML: High-Performance Sparse Communication for Machine Learning

One of the main drivers behind the rapid recent advances in machine lear...

Please sign up or login with your details

Forgot password? Click here to reset