论文标题

ADASGD:桥接SGD和Adam之间的差距

AdaSGD: Bridging the gap between SGD and Adam

论文作者

Wang, Jiaxuan, Wiens, Jenna

论文摘要

在随机梯度下降(SGD)和自适应力矩估计(ADAM)的背景下,研究人员最近提出了从ADAM过渡到SGD的优化技术,目的是提高收敛性和泛化性能。但是,确切地说,每种方法如何交易早期进步和概括都没有得到很好的理解。因此,目前尚不清楚何时或何时将一个方法从一种方法过渡到另一种方法。在这项工作中,通过首先研究凸设置,我们确定了SGD和Adam之间观察到的性能差异的潜在贡献者。特别是,我们提供了理论上的见解,何时以及为什么亚当表现优于SGD,反之亦然。我们通过适应SGD的单个全球学习率来添加性能差距,我们称之为ADASGD。我们通过非凸面设置中的经验分析证明了这种建议的方法。在跨越三个不同域的几个数据集上,我们演示了ADASGD如何结合SGD和Adam的好处,从而消除了从ADAM到SGD过渡的方法的需求。

In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving both convergence and generalization performance. However, precisely how each approach trades off early progress and generalization is not well understood; thus, it is unclear when or even if, one should transition from one approach to the other. In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam. In particular,we provide theoretical insights for when and why Adam outperforms SGD and vice versa. We ad-dress the performance gap by adapting a single global learning rate for SGD, which we refer to as AdaSGD. We justify this proposed approach with empirical analyses in non-convex settings. On several datasets that span three different domains,we demonstrate how AdaSGD combines the benefits of both SGD and Adam, eliminating the need for approaches that transition from Adam to SGD.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源