Parallel training of DNNs with Natural Gradient and Parameter Averaging

10/27/2014
by   Daniel Povey, et al.
0

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
04/23/2023

Hierarchical Weight Averaging for Deep Neural Networks

Despite the simplicity, stochastic gradient descent (SGD)-like algorithm...
research
12/06/2018

Elastic Gossip: Distributing Neural Network Training Using Gossip-like Protocols

Distributing Neural Network training is of particular interest for sever...
research
07/05/2015

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neura...
research
03/17/2017

Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling

Deep learning models (DLMs) are state-of-the-art techniques in speech re...
research
03/22/2021

Data Cleansing for Deep Neural Networks with Storage-efficient Approximation of Influence Functions

Identifying the influence of training data for data cleansing can improv...
research
05/04/2023

A Bootstrap Algorithm for Fast Supervised Learning

Training a neural network (NN) typically relies on some type of curve-fo...

Please sign up or login with your details

Forgot password? Click here to reset