论文标题
深度学习的两级K-FAC预处理
Two-Level K-FAC Preconditioning for Deep Learning
论文作者
论文摘要
在深度学习的背景下,许多优化方法使用梯度协方差信息,以加速随机梯度下降的收敛性。特别是,从Adagrad开始,看似无尽的研究领域主张在基于随机梯度的算法中使用所谓的经验Fisher基质的对角线近似,最突出的是Adam。然而,近年来,几项作品对与经验的Fisher Matrix进行预处理的理论基础提出了怀疑,并且已经表明,实际Fisher基质的更复杂的近似值更像理论上动机良好的自然梯度下降。这种方法的一种特别成功的变体是所谓的K-FAC优化器,它使用Kronecker因子块 - 二角渔民近似作为预处理器。在这项工作中,从用作科学计算领域的预处理的两级域分解方法中汲取灵感,我们通过以计算上有效的方式用偏高(即全局)曲率信息来扩展k-fac。我们通过在预处理中添加粗空校正项来实现这一目标,该校正器以更粗的规模捕获全球Fisher信息矩阵。我们提出了一小部分实验结果,表明我们提出的方法的收敛行为得到了改善。
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix in stochastic gradient-based algorithms, with the most prominent one arguably being Adam. However, in recent years, several works cast doubt on the theoretical basis of preconditioning with the empirical Fisher matrix, and it has been shown that more sophisticated approximations of the actual Fisher matrix more closely resemble the theoretically well-motivated Natural Gradient Descent. One particularly successful variant of such methods is the so-called K-FAC optimizer, which uses a Kronecker-factored block-diagonal Fisher approximation as preconditioner. In this work, drawing inspiration from two-level domain decomposition methods used as preconditioners in the field of scientific computing, we extend K-FAC by enriching it with off-diagonal (i.e. global) curvature information in a computationally efficient way. We achieve this by adding a coarse-space correction term to the preconditioner, which captures the global Fisher information matrix at a coarser scale. We present a small set of experimental results suggesting improved convergence behaviour of our proposed method.