路由语言模型的统一比例定律

论文标题

路由语言模型的统一比例定律

Unified Scaling Laws for Routed Language Models

论文作者

Clark, Aidan, Casas, Diego de las, Guy, Aurelia, Mensch, Arthur, Paganini, Michela, Hoffmann, Jordan, Damoc, Bogdan, Hechtman, Blake, Cai, Trevor, Borgeaud, Sebastian, Driessche, George van den, Rutherford, Eliza, Hennigan, Tom, Johnson, Matthew, Millican, Katie, Cassirer, Albin, Jones, Chris, Buchatskaya, Elena, Budden, David, Sifre, Laurent, Osindero, Simon, Vinyals, Oriol, Rae, Jack, Elsen, Erich, Kavukcuoglu, Koray, Simonyan, Karen

论文摘要

语言模型的性能已被证明是在其参数计数中有效建模为幂律的。在这里，我们研究路由网络的缩放行为：在处理输入时仅有条件使用其参数子集的体系结构。对于这些模型，参数计数和计算需求形成了两个独立轴，增加了增加的性能。在这项工作中，我们得出了这两个变量定义的缩放定律，这些变量概括了标准语言模型已知的变量，并描述了通过三种不同技术训练的广泛路由体系结构的性能。之后，我们提供了这些定律的两种应用：首先得出一个有效的参数计数，所有模型都以相同的速率缩放，然后使用缩放系数对所考虑的三种路由技术进行定量比较。我们的分析源自对五个数量大小的路由网络的广泛评估，其中包括具有数百名专家和数百十亿个参数的模型。

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题