论文标题
Bert-Emd:以地球搬运工的距离进行BERT压缩的多一对层映射
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance
论文作者
论文摘要
预训练的语言模型(例如BERT)在各种自然语言处理(NLP)任务中取得了重大成功。但是,高存储和计算成本阻碍了预先训练的语言模型,可有效部署在资源约束设备上。在本文中,我们提出了一种基于多到许多层映射的新型BERT蒸馏方法,该方法允许每个中级学生层从任何中级教师层中学习。通过这种方式,我们的模型可以从不同的教师层中学习各种NLP任务。由直觉的动机,即不同的NLP任务需要BERT中间层中包含的不同语言知识。此外,我们利用Earth Mover的距离(EMD)来计算必须支付的最低累积成本,以将知识从教师网络转变为学生网络。 EMD可以为多到许多层映射提供有效的匹配。 %EMD可以应用于具有不同尺寸的网络层,并有效地衡量教师网络和学生网络之间的语义距离。此外,我们提出了一种成本注意机制,以自动学习EMD中使用的层权重,这应该进一步改善模型的性能并加速收敛时间。关于胶水基准的广泛实验表明,就准确性和模型压缩而言,与强大的竞争对手相比,我们的模型可以实现竞争性能。
Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. %motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the intermediate layers of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. %EMD can be applied to network layers with different sizes and effectively measures semantic distance between the teacher network and student network. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression.