论文标题
主人:多任务预训练的瓶装蒙面自动编码器是更好的致密猎犬
MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers
论文作者
论文摘要
预训练的变压器(\ eg Bert)通常用于参数初始化的现有密集检索方法中,并且最近的研究正在探索更有效的训练前任务,以进一步提高密集媒介的质量。尽管已经提出了各种新颖和有效的任务,但它们的不同输入格式和学习目标使它们很难集成以共同改善模型性能。在这项工作中,我们旨在将各种预训练的任务统一为瓶颈蒙面的自动编码器方式,并将其集成到多任务预训练的模型中,即Master。具体而言,主人利用共享编码器的多码编码器架构,该体系结构可以构造表示瓶颈,以将任务跨任务的丰富语义信息压缩到密集的向量中。基于它,我们整合了三种类型的代表性预训练任务:损坏的段落恢复,相关段落恢复和PLMS输出恢复,以表征内部通用信息,封闭式关系和PLMS知识。广泛的实验表明,我们的方法的表现优于竞争性密集检索方法。我们的代码和数据在\ url {https://github.com/microsoft/simxns}中公开发布。
Pre-trained Transformers (\eg BERT) have been commonly used in existing dense retrieval methods for parameter initialization, and recent studies are exploring more effective pre-training tasks for further improving the quality of dense vectors. Although various novel and effective tasks have been proposed, their different input formats and learning objectives make them hard to be integrated for jointly improving the model performance. In this work, we aim to unify a variety of pre-training tasks into the bottlenecked masked autoencoder manner, and integrate them into a multi-task pre-trained model, namely MASTER. Concretely, MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors. Based on it, we integrate three types of representative pre-training tasks: corrupted passages recovering, related passages recovering and PLMs outputs recovering, to characterize the inner-passage information, inter-passage relations and PLMs knowledge. Extensive experiments have shown that our approach outperforms competitive dense retrieval methods. Our code and data are publicly released in \url{https://github.com/microsoft/SimXNS}.