除了过度参数张量分解的懒惰训练之外

论文标题

除了过度参数张量分解的懒惰训练之外

Beyond Lazy Training for Over-parameterized Tensor Decomposition

论文作者

Wang, Xiang, Wu, Chenwei, Lee, Jason D., Ma, Tengyu, Ge, Rong

论文摘要

过度参数是训练神经网络的重要技术。在理论和实践中，培训更大的网络允许优化算法避免局部最佳解决方案。在本文中，我们研究了一个密切相关的张量分解问题：给定$（r^d）^{\ otimes l} $的$ l $ -th订单张量，等级$ r $（其中$ r \ ll d $），可以找到梯度下降的变体$ m $ m $ demotsions $ m $ demotsion ther我们表明，在一个懒惰的训练方案（类似于神经网络的NTK制度）中，人们需要至少$ m =ω（d^{l-1}）$，而当梯度下降的一种变体可以找到$ m = o^*（r^{2.5l}} \ log d d）时。我们的研究结果表明，过度参数目标的梯度下降可能超出了懒惰的训练制度，并利用数据中的某些低级结构。

Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an $l$-th order tensor in $(R^d)^{\otimes l}$ of rank $r$ (where $r\ll d$), can variants of gradient descent find a rank $m$ decomposition where $m > r$? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = Ω(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2.5l}\log d)$. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题