Accumulated Gradient Normalization

Joeri R. Hermans; Gerasimos Spanakis; Rico Möckel

Accumulated Gradient Normalization

Joeri R. Hermans, Gerasimos Spanakis, Rico Möckel

Advanced Computing Sciences

Research output: Chapter in Book/Report/Conference proceeding › Chapter › Academic

43 Downloads (Pure)

Abstract

This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous textsceasgd and which we show empirically.

Original language	English
Title of host publication	Proceedings of the 9th Asian Conference on Machine Learning
Editors	Min-Ling Zhang, Yung-Kyun Noh
Publisher	Proceedings of Machine Learning Research
Pages	439-454
Number of pages	16
Volume	77
Publication status	Published - Nov 2017

Publication series

Series	Proceedings of Machine Learning Research

Access to Document

Full text Final published version, 4.72 MBLicence: Taverne

http://proceedings.mlr.press/v77/hermans17a.html

Cite this

@inbook{d4abd964a4e24090829b05b41fc8f7fb,

title = "Accumulated Gradient Normalization",

abstract = "This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous textsceasgd and which we show empirically.",

author = "Hermans, {Joeri R.} and Gerasimos Spanakis and Rico M{\"o}ckel",

year = "2017",

month = nov,

language = "English",

volume = "77",

series = "Proceedings of Machine Learning Research",

publisher = "Proceedings of Machine Learning Research",

pages = "439--454",

editor = "Min-Ling Zhang and Yung-Kyun Noh",

booktitle = "Proceedings of the 9th Asian Conference on Machine Learning",

}

Accumulated Gradient Normalization. / Hermans, Joeri R.; Spanakis, Gerasimos ; Möckel, Rico.
Proceedings of the 9th Asian Conference on Machine Learning. ed. / Min-Ling Zhang; Yung-Kyun Noh. Vol. 77 Proceedings of Machine Learning Research, 2017. p. 439-454 (Proceedings of Machine Learning Research).

Research output: Chapter in Book/Report/Conference proceeding › Chapter › Academic

TY - CHAP

T1 - Accumulated Gradient Normalization

AU - Hermans, Joeri R.

AU - Spanakis, Gerasimos

AU - Möckel, Rico

PY - 2017/11

Y1 - 2017/11

N2 - This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous textsceasgd and which we show empirically.

AB - This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous textsceasgd and which we show empirically.

M3 - Chapter

VL - 77

T3 - Proceedings of Machine Learning Research

SP - 439

EP - 454

BT - Proceedings of the 9th Asian Conference on Machine Learning

A2 - Zhang, Min-Ling

A2 - Noh, Yung-Kyun

PB - Proceedings of Machine Learning Research

ER -