论文标题
AutoDistil:很少射击任务不可能的神经体系结构搜索用于提炼大型语言模型
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
论文作者
论文摘要
知识蒸馏(KD)方法将大型模型压缩为具有手动设计的学生体系结构的较小学生,并给出了预先指定的计算成本。这需要一些试验来找到一个可行的学生,并进一步重复每个学生或计算预算更改的过程。我们使用神经体系结构搜索(NAS)自动提炼几个压缩学生,其成本可变,大型型号。当前的作品训练了一个由数百万个子网组成的单个超级Le训练,并带有体重共享,从而导致不同尺寸的子网之间的干扰。我们的框架AutoDistil通过以下步骤解决了上述挑战:(a)将电感偏差和启发式方法纳入分区变压器的搜索空间中K icsact子空间(k = 3 k = 3,用于基础的典型学生大小,小而小); (b)使用任务不合时宜的目标(例如,自我注意蒸馏)训练每个子空间的一个超级LM,并通过学生的体重分担; (c)轻巧搜索最佳学生而无需重新培训。完全任务不足的培训和搜索允许学生重新使用任何下游任务进行微调。针对最先进的KD和NAS方法的胶水基准的实验证明了自动赛能够超越领先的压缩技术,计算成本降低了2.7倍,任务绩效的损失却忽略不计。
Knowledge distillation (KD) methods compress large models into smaller students with manually-designed student architectures given pre-specified computational cost. This requires several trials to find a viable student, and further repeating the process for each student or computational budget change. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Current works train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Our framework AutoDistil addresses above challenges with the following steps: (a) Incorporates inductive bias and heuristics to partition Transformer search space into K compact sub-spaces (K=3 for typical student sizes of base, small and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic objective (e.g., self-attention distillation) with weight-sharing of students; (c) Lightweight search for the optimal student without re-training. Fully task-agnostic training and search allow students to be reused for fine-tuning on any downstream task. Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques with upto 2.7x reduction in computational cost and negligible loss in task performance.