Towards Geo-Distributed Machine Learning

by   Ignacio Cano, et al.

Latency to end-users and regulatory requirements push large companies to build data centers all around the world. The resulting data is "born" geographically distributed. On the other hand, many machine learning applications require a global view of such data in order to achieve the best results. These types of applications form a new class of learning problems, which we call Geo-Distributed Machine Learning (GDML). Such applications need to cope with: 1) scarce and expensive cross-data center bandwidth, and 2) growing privacy concerns that are pushing for stricter data sovereignty regulations. Current solutions to learning from geo-distributed data sources revolve around the idea of first centralizing the data in one data center, and then training locally. As machine learning algorithms are communication-intensive, the cost of centralizing the data is thought to be offset by the lower cost of intra-data center communication during training. In this work, we show that the current centralized practice can be far from optimal, and propose a system for doing geo-distributed training. Furthermore, we argue that the geo-distributed approach is structurally more amenable to dealing with regulatory constraints, as raw data never leaves the source data center. Our empirical evaluation on three real datasets confirms the general validity of our approach, and shows that GDML is not only possible but also advisable in many scenarios.


page 1

page 2

page 3

page 4


Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

The usability and practicality of any machine learning (ML) applications...

Robust Coreset Construction for Distributed Machine Learning

Motivated by the need of solving machine learning problems over distribu...

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

When the data is distributed across multiple servers, efficient data exc...

OCTOPUS: Overcoming Performance andPrivatization Bottlenecks in Distributed Learning

The diversity and quantity of the data warehousing, gathering data from ...

Communication-efficient k-Means for Edge-based Machine Learning

We consider the problem of computing the k-means centers for a large hig...

Distributed Sparse Linear Regression under Communication Constraints

In multiple domains, statistical tasks are performed in distributed sett...

SparCML: High-Performance Sparse Communication for Machine Learning

One of the main drivers behind the rapid recent advances in machine lear...

Please sign up or login with your details

Forgot password? Click here to reset