论文标题
除了过度参数张量分解的懒惰训练之外
Beyond Lazy Training for Over-parameterized Tensor Decomposition
论文作者
论文摘要
过度参数是训练神经网络的重要技术。在理论和实践中,培训更大的网络允许优化算法避免局部最佳解决方案。在本文中,我们研究了一个密切相关的张量分解问题:给定$(r^d)^{\ otimes l} $的$ l $ -th订单张量,等级$ r $(其中$ r \ ll d $),可以找到梯度下降的变体$ m $ m $ demotsions $ m $ demotsion ther我们表明,在一个懒惰的训练方案(类似于神经网络的NTK制度)中,人们需要至少$ m =ω(d^{l-1})$,而当梯度下降的一种变体可以找到$ m = o^*(r^{2.5l}} \ log d d)时。我们的研究结果表明,过度参数目标的梯度下降可能超出了懒惰的训练制度,并利用数据中的某些低级结构。
Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an $l$-th order tensor in $(R^d)^{\otimes l}$ of rank $r$ (where $r\ll d$), can variants of gradient descent find a rank $m$ decomposition where $m > r$? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = Ω(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2.5l}\log d)$. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.