论文标题
MSTGD:一种具有指数收敛速率的内存随机分层梯度下降法
MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an Exponential Convergence Rate
论文作者
论文摘要
连续迭代之间的参数更新引起的梯度期望和方差的波动效应被当前主流梯度优化算法忽略或混淆。使用这种波动效应,结合了分层抽样策略,本文设计了小说\ nline {m} emory {m} emory {m} emory {s} tonline {以指数收敛速率下降(\下列{MST} GD)算法。具体而言,MSTGD使用两种策略来减少方差:第一种策略是根据使用的历史梯度的比例P执行差异降低,这是根据迭代前后样本梯度的平均值和方差估算的,而另一个策略则按类别分类。在这两种策略下设计的统计\ $ \ bar {g} _ {MST} $ \可以自适应地公正,其方差以几何速率衰减。这可以基于$ \ bar {g} _ {mst} $获得MSTGD,以获得$λ^{2(k-k_0)} $的指数收敛速率($λ\ in($λ\ in(0,1)$,k,k,k,k是iTeation步骤的数量,$λ$与$λ$相关的索赔,以实现Alg p p)。收敛速率独立于参数,例如数据集大小n,批处理大小n等,并且可以在恒定的步骤尺寸下实现。理论和实验结果显示了MSTGD的有效性
The fluctuation effect of gradient expectation and variance caused by parameter update between consecutive iterations is neglected or confusing by current mainstream gradient optimization algorithms.Using this fluctuation effect, combined with the stratified sampling strategy, this paper designs a novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient Descend(\underline{MST}GD) algorithm with an exponential convergence rate. Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient, which is estimated from the mean and variance of sample gradients before and after iteration, and the other strategy is stratified sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two strategies can be adaptively unbiased, and its variance decays at a geometric rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential convergence rate of the form $λ^{2(k-k_0)}$($λ\in (0,1)$,k is the number of iteration steps,$λ$ is a variable related to proportion p).Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc., and can be achieved at a constant step size.Theoretical and experimental results show the effectiveness of MSTGD