论文标题
贝叶斯插值与深线性网络
Bayesian Interpolation with Deep Linear Networks
论文作者
论文摘要
表征神经网络的深度,宽度和数据集大小共同影响模型质量是深度学习理论的核心问题。我们在此处为线性网络的特殊情况提供了一个完整的解决方案,其输出维度是使用零噪声贝叶斯与高斯重量先验的零噪声推理训练的,而均值误差则是负log类似物。对于任何训练数据集,网络深度和隐藏层的宽度,我们发现了预测性后部和贝叶斯模型的非反应表达式,这是Meijer-G函数(Meijer-g函数),这是单个复杂变量的一类Meromorormorphic特殊功能。通过这些麦吉尔G函数的新型渐近扩展,出现了深度,宽度和数据集大小的共同作用的丰富新图片。我们表明,线性网络在无限深度上可证明是最佳的预测:具有数据不可能的先验的无限深线性网络的后验与具有数据依赖数据依赖性先验的浅层网络的浅网络相同。这会产生一个有原则的理由,即当先验者被迫是数据敏捷的情况下,更喜欢更深的网络。此外,我们表明,使用数据不足的先验,宽线性网络中的贝叶斯模型证据在无限深度上最大化,从而阐明了对模型选择的深度的有益作用。基础我们的结果是一个新颖的有效深度的新兴概念,由隐藏层的数量乘以数据点除以网络宽度的数量;这确定了大数据限制中后部的结构。
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.