神经网络可以学习梯度下降的表示形式

论文标题

神经网络可以学习梯度下降的表示形式

Neural Networks can Learn Representations with Gradient Descent

论文作者

Damian, Alex, Lee, Jason D., Soltanolkotabi, Mahdi

论文摘要

重要的理论工作已经确定，在特定的制度中，通过梯度下降训练的神经网络像内核方法一样。但是，在实践中，众所周知，神经网络非常优于其相关内核。在这项工作中，我们通过证明内核方法无法有效地学到的一大批功能来解释这一差距，但是可以通过与目标任务相关的学习表示，可以轻松地通过梯度下降到内核制度外的两层神经网络。我们还证明了这些表示允许有效的转移学习，这在内核制度中是不可能的。具体来说，我们考虑学习多项式的问题，该问题仅取决于少数相关的方向，即$ f^\ star（x）= g（ux）$的形式，其中$ u：\ r^d \ to \ r^r $ at $ d \ gg gg r $。当$ f^\ star $的度数为$ p $时，众所周知，在内核制度中学习$ f^\ star $是必需的。我们的主要结果是梯度下降学习了数据的表示，该表示仅取决于与$ f^\ star $相关的指示。这导致了$ n \ asymp d^2 r + dr^p $的改进样品复杂性。此外，在转移学习设置中，源和目标域中的数据分布共享相同的表示形式$ u $，但具有不同的多项式头部，我们表明，转移学习的流行启发式启发式启发式具有目标样本复杂性，独立于$ d $。

Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2 r + dr^p$. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题