论文标题
作为参数的重要性:一种严重收益的内部依据方法
The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains
论文作者
论文摘要
最近的模型修剪方法证明了在不牺牲模型性能的情况下去除冗余参数的能力。常见方法根据参数灵敏度去除冗余参数,这是一种基于梯度的测量,反映了参数的贡献。但是,在本文中,我们认为可以培训冗余参数以做出有益的贡献。我们首先强调了高敏性和低敏感性参数之间的较大灵敏度(贡献)差距,并表明在平衡所有参数的贡献后,模型的概括性能可以显着提高。我们的目标是平衡所有参数的敏感性,并鼓励所有参数平等地做出贡献。我们提出了一种通用的任务不足的方法,即缩减内缩减,附加到常规的训练损失,以平衡参数灵敏度。此外,我们还设计了一种新型的自适应学习方法,以控制更快收敛的依次损失的强度。我们的实验表明,我们方法对机器翻译,自然语言理解和零镜头的跨语言转移的强大有效性,例如,从IWSLT'14翻译数据集中,在8个语言对中,平均获得了3.54 bleu的增长。
Recent model pruning methods have demonstrated the ability to remove redundant parameters without sacrificing model performance. Common methods remove redundant parameters according to the parameter sensitivity, a gradient-based measure reflecting the contribution of the parameters. In this paper, however, we argue that redundant parameters can be trained to make beneficial contributions. We first highlight the large sensitivity (contribution) gap among high-sensitivity and low-sensitivity parameters and show that the model generalization performance can be significantly improved after balancing the contribution of all parameters. Our goal is to balance the sensitivity of all parameters and encourage all of them to contribute equally. We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance parameter sensitivity. Moreover, we also design a novel adaptive learning method to control the strength of intra-distillation loss for faster convergence. Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer across up to 48 languages, e.g., a gain of 3.54 BLEU on average across 8 language pairs from the IWSLT'14 translation dataset.