论文标题
表征并找到良好的数据顺序,以快速收敛顺序梯度方法
Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods
论文作者
论文摘要
从理论上广泛研究了从替代数据中取样的SGD,但在实践中,称为随机改组(RR)的变体更为常见。 RR通过数据集的随机排列进行迭代,并且已显示出比SGD更快的收敛速度。当确定性地选择顺序时,称为增量梯度下降(IG)的变体时,现有的收敛范围显示出比SGD的改善,但比RR差。但是,这些界限并没有区分好秩序和坏订单,而是最糟糕的秩序选择。同时,在某些情况下,使用IG时选择正确的顺序可以导致收敛速度快于RR。在这项工作中,我们量化了顺序对收敛速度的影响,根据所选排列序列获得收敛范围,同时还恢复了RR的先前结果。此外,我们在理论上和实践中存在各种级别的抽象(例如任务,类,增强等)时,可以显示出使用结构化改组的好处。最后,依靠我们的措施,我们开发了一种贪婪的算法来在训练过程中选择良好的订单,在RR上取得优越的性能(准确性超过14%)。
While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.