知识蒸馏转移集及其对下游NLU任务的影响

论文标题

知识蒸馏转移集及其对下游NLU任务的影响

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

论文作者

Peris, Charith, Tan, Lizhen, Gueudre, Thomas, Gojayev, Turan, Wei, Pan, Oz, Gokmen

论文摘要

教师知识蒸馏是一种流行的技术，可以将当今的大型语言模型压缩为适合下游应用程序低延迟性的可管理尺寸。老师和用于蒸馏的转移集选择都是创建高质量学生的关键成分。然而，用于预识教师和与下游目标领域相关的语料库的通用语料库通常会大不相同，这引发了一个自然的问题：是否应该将学生蒸馏而成，以便从高质量的教师预测中学习，或者从下游任务公司中学习以与Finetuning保持一致？我们的研究使用域分类（DC）和意图分类/命名实体识别（ICNER）作为下游任务调查了这一权衡。我们将几个多语言学生从较大的多语言LM中提取，具有不同比例的通用和特定于任务的数据集，并在DC和ICNER上进行填充后报告其性能。当仅使用特定于任务的语料库时，我们会观察到任务和测试集的显着改进。我们还报告将特定于任务数据添加到传输集的影响与通用和特定于任务数据之间的相似性相关。我们的结果清楚地表明，尽管对通用LM的蒸馏有利于下游任务，但使用目标域数据的学生学习得更好，即使这是以嘈杂的教师预测的价格。换句话说，目标领域数据仍然胜过教师知识。

Teacher-student knowledge distillation is a popular technique for compressing today's prevailing large language models into manageable sizes that fit low-latency downstream applications. Both the teacher and the choice of transfer set used for distillation are crucial ingredients in creating a high quality student. Yet, the generic corpora used to pretrain the teacher and the corpora associated with the downstream target domain are often significantly different, which raises a natural question: should the student be distilled over the generic corpora, so as to learn from high-quality teacher predictions, or over the downstream task corpora to align with finetuning? Our study investigates this trade-off using Domain Classification (DC) and Intent Classification/Named Entity Recognition (ICNER) as downstream tasks. We distill several multilingual students from a larger multilingual LM with varying proportions of generic and task-specific datasets, and report their performance after finetuning on DC and ICNER. We observe significant improvements across tasks and test sets when only task-specific corpora is used. We also report on how the impact of adding task-specific data to the transfer set correlates with the similarity between generic and task-specific data. Our results clearly indicate that, while distillation from a generic LM benefits downstream tasks, students learn better using target domain data even if it comes at the price of noisier teacher predictions. In other words, target domain data still trumps teacher knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题