论文标题
计算最佳神经缩放法的信息理论分析
An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws
论文作者
论文摘要
我们研究了大型神经网络的模型和训练数据集之间的计算最佳权衡。我们的结果表明,线性关系类似于龙猫的经验分析。尽管该工作研究了基于变形金刚在Massivetext Copus Gopher进行培训的大型语言模型,但作为数学理论开发的起点,我们专注于更简单的学习模型和数据生成过程,每个模型和数据生成过程都基于具有sigmoid autive单元和单一隐藏的relu激活单位的神经网络。我们介绍了一类算法的一般错误上限,这些算法会逐步更新统计量(例如梯度下降)。对于受Barron 1993启发的特定学习模型,我们在理论上可实现的预期误差最小的信息和数据集大小的函数上建立了上限。然后,我们得出最小化该结合的计算分配。我们提出了经验结果,这表明此近似正确识别了渐近线性计算尺度。此近似也会产生新的见解。除其他事项外,它表明,随着输入维度或潜在空间复杂性的增长,例如,如果将较长的令牌历史记录为语言模型的输入,则应将更大的计算预算分配给增长学习模型而不是培训数据。
We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation also generates new insights. Among other things, it suggests that, as the input dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data.