论文标题
一种固定的两层线性网络会鞭打具有完全连接的输入层的任何神经网络
A case where a spindly two-layer linear network whips any neural network with a fully connected input layer
论文作者
论文摘要
据推测,在节点上的任何结构和任意可区分传递函数的任何神经网络在接受梯度下降训练时都无法有效地了解以下问题样本:实例是$ d $ d $维的hadamard矩阵的行,目标是一个特征之一,即非常稀疏。我们从本质上证明了这个猜想:我们表明,在接受$ k <d $的随机训练集后,预期的平方损失仍然为$ 1- \ frac {k} {(d-1)} $。唯一需要的要求是输入层已完全连接,并且输入节点的初始权重向量是从旋转不变分布中选择的。 出乎意料的是,通过简单的2层线性神经网络,可以通过长度2的链条连接到$ d $输入的简单2层线性神经网络可以更高效地解决相同类型的问题(现在输入层只有一个edge expter)。当这样的网络通过梯度下降训练时,已经证明其预期的平方损失为$ \ frac {\ log d} {k} $。 我们的下限本质上表明,当示例数量小于输入特征的数量时,需要一个稀疏的输入层以有效地学习具有梯度下降的稀疏目标。
It was conjectured that any neural network of any structure and arbitrary differentiable transfer functions at the nodes cannot learn the following problem sample efficiently when trained with gradient descent: The instances are the rows of a $d$-dimensional Hadamard matrix and the target is one of the features, i.e. very sparse. We essentially prove this conjecture: We show that after receiving a random training set of size $k < d$, the expected square loss is still $1-\frac{k}{(d-1)}$. The only requirement needed is that the input layer is fully connected and the initial weight vectors of the input nodes are chosen from a rotation invariant distribution. Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network in which the $d$ inputs are connected to the output node by chains of length 2 (Now the input layer has only one edge per input). When such a network is trained by gradient descent, then it has been shown that its expected square loss is $\frac{\log d}{k}$. Our lower bounds essentially show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent when the number of examples is less than the number of input features.