元组合参数学习

论文标题

元组合参数学习

Meta-Ensemble Parameter Learning

论文作者

Fei, Zhengcong, Tian, Shuman, Huang, Junshi, Wei, Xiaoming, Wei, Xiaolin

论文摘要

机器学习模型的合奏产生了改善的性能和鲁棒性。但是，他们的记忆要求和推理成本可能很高。知识蒸馏是一种方法，它允许单个模型有效地捕获合奏的近似性能，同时在引入新的教师模型时显示出较差的可扩展性，这是对重新训练的需求。在本文中，我们研究是否可以利用元学习策略直接预测单个模型的参数，并具有可比的合奏性能。迄今为止，我们介绍了一个基于变压器的模型，它可以根据教师模型参数在向前的通行证中逐层预测学生网络权重。在CIFAR-10，CIFAR-100和Imagenet数据集上研究了WighterFormer的所有权，用于VGGNET-11，RESNET-50和VIT-B/32的模型结构，其中我们的方法可以证明我们的方法可以实现合奏的近似分类性能，并既超越单个网络知识和标准知识蒸馏。更令人鼓舞的是，我们表明，通过较小的微调，体重构成结果可以进一步超过平均合奏。重要的是，我们的任务以及模型和结果可能会导致整体网络参数学习的新的，更高效，更可扩展的范式。

Ensemble of machine learning models yields improved performance as well as robustness. However, their memory requirements and inference costs can be prohibitively high. Knowledge distillation is an approach that allows a single model to efficiently capture the approximate performance of an ensemble while showing poor scalability as demand for re-training when introducing new teacher models. In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble. Hereto, we introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass, according to the teacher model parameters. The proprieties of WeightFormer are investigated on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method can achieve approximate classification performance of an ensemble and outperforms both the single network and standard knowledge distillation. More encouragingly, we show that WeightFormer results can further exceeds average ensemble with minor fine-tuning. Importantly, our task along with the model and results can potentially lead to a new, more efficient, and scalable paradigm of ensemble networks parameter learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题