论文标题
梯度下降,随机优化和其他故事
Gradient Descent, Stochastic Optimization, and Other Tales
论文作者
论文摘要
本文的目的是揭露和消除黑盒优化器和随机优化器背后的魔力。它旨在为技术如何以及为何运作奠定坚实的基础。该手稿通过简单的直觉(策略背后的数学)来衍生出这些知识。该教程并没有回避解决梯度下降和随机优化方法的正式和非正式方面。通过这样做,它希望为读者提供对这些技术以及何时,如何应用这些算法的更深入的了解。 梯度下降是执行优化的最流行算法之一,也是迄今为止优化机器学习任务的最常见方法。近年来,其随机版本引起了人们的关注,并且对于优化深层神经网络而言尤其如此。在深层神经网络中,采用单个样本或一批样本来节省计算资源并逃脱鞍点。 1951年,Robbins和Monro发表了\ textit {随机近似方法},这是随机优化的首批现代处理方法之一,它用新样本估算了本地梯度。而现在,随机优化已成为机器学习的核心技术,这主要是由于拟合神经网络的背部传播算法的发展。本文的唯一目的是对梯度下降和随机优化的概念和数学工具进行独立介绍。
The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic optimizers. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and stochastic optimization methods. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these algorithms. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks. Its stochastic version receives attention in recent years, and this is particularly true for optimizing deep neural networks. In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points. In 1951, Robbins and Monro published \textit{A stochastic approximation method}, one of the first modern treatments on stochastic optimization that estimates local gradients with a new batch of samples. And now, stochastic optimization has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this article is to give a self-contained introduction to concepts and mathematical tools in gradient descent and stochastic optimization.