大语模型的参数效率稀疏性微调

论文标题

大语模型的参数效率稀疏性微调

Parameter-Efficient Sparsity for Large Language Models Fine-Tuning

论文作者

Li, Yuchao, Luo, Fuli, Tan, Chuanqi, Wang, Mengdi, Huang, Songfang, Li, Shen, Bai, Junjie

论文摘要

随着语言模型中参数数量的急剧增加，稀疏方法已获得越来越多的研究重点，以压缩和加速模型。尽管大多数研究都集中于如何在保持压缩模型的性能的同时准确保留适当的权重，但在压缩大规模语言模型时，稀疏训练的计算开销和记忆足迹却存在挑战。为了解决这个问题，我们提出了一种参数有效的稀疏训练（PST）方法，以减少下游任务中稀疏感知训练期间的可训练参数的数量。具体而言，我们首先将无数据和数据驱动的标准组合在一起，以有效，准确地测量权重的重要性。然后，我们研究了数据驱动的权重重要性的内在冗余，并得出了两个明显的特征，即低级别和结构性。基于此，引入了两组小型矩阵来计算权重的数据驱动的重要性，而不是使用原始的大重要得分矩阵，因此，这使得稀疏的训练资源有效且参数有效。在数十个数据集上使用不同网络（即Bert，Roberta和GPT-2）进行的实验表明，尽管仅训练了少量参数，但PST在PAR上的表现或比以前的稀疏方法更好。例如，与以前的稀疏方法相比，我们的PST仅需要1.5％的可训练参数即可在BERT上实现可比的性能。

With the dramatically increased number of parameters in language models, sparsity methods have received ever-increasing research focus to compress and accelerate the models. While most research focuses on how to accurately retain appropriate weights while maintaining the performance of the compressed model, there are challenges in the computational overhead and memory footprint of sparse training when compressing large-scale language models. To address this problem, we propose a Parameter-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training in downstream tasks. Specifically, we first combine the data-free and data-driven criteria to efficiently and accurately measure the importance of weights. Then we investigate the intrinsic redundancy of data-driven weight importance and derive two obvious characteristics i.e., low-rankness and structuredness. Based on that, two groups of small matrices are introduced to compute the data-driven importance of weights, instead of using the original large importance score matrix, which therefore makes the sparse training resource-efficient and parameter-efficient. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. For instance, compared with previous sparsity methods, our PST only requires 1.5% trainable parameters to achieve comparable performance on BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题