通过Frank-Wolfe算法沿本地最佳轨迹的RNN训练

论文标题

通过Frank-Wolfe算法沿本地最佳轨迹的RNN训练

RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

论文作者

Yue, Yun, Li, Ming, Saligrama, Venkatesh, Zhang, Ziming

论文摘要

我们通过迭代地在小区域内的损耗表面上寻求局部最小值，并利用该方向向量进行更新。我们建议在这种情况下利用Frank-Wolfe（FW）算法。尽管FW隐式涉及归一化梯度，这可能导致收敛速度缓慢，但我们开发了一种新型的RNN训练方法，即使有了额外的成本，从经验上看，总体培训成本也低于背部传播。我们的方法导致了一种新的Frank-Wolfe方法，从本质上讲，它是一种带有重新启动方案的SGD算法。我们证明，在某些条件下，我们的算法具有$ o（1/ε）$ $ε$误差的均方根收敛率。然后，我们在几个基准数据集上进行经验实验，包括表现出长期依赖性的数据集，并显示出显着的性能改善。我们还尝试深入RNN体系结构并显示出高效的训练性能。最后，我们证明我们的训练方法对嘈杂的数据是可靠的。

We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of $O(1/ε)$ for $ε$ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题