基于句子相似性的综合数据识别的重新加权策略

论文标题

基于句子相似性的综合数据识别的重新加权策略

Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity

论文作者

Kim, Taehee, Park, ChaeHun, Hong, Jimin, Dua, Radhika, Choi, Edward, Choo, Jaegul

论文摘要

语义上有意义的句子嵌入对于自然语言处理中的许多任务都很重要。为了获得此类嵌入，最近的研究探讨了利用从验证的语言模型（PLM）作为培训语料库中合成生成的数据的想法。但是，PLM通常会产生与人类写的句子大不相同的句子。我们假设将所有这些综合示例同样治疗用于训练深层神经网络可能会对学习语义上有意义的嵌入产生不利影响。为了分析这一点，我们首先训练一个分类器来识别机器编写的句子，并观察到机器编写的句子的语言特征与人写的句子的语言特征大不相同。基于此，我们提出了一种新颖的方法，该方法首先训练分类器来衡量每个句子的重要性。然后，分类器的蒸馏信息用于训练可靠的句子嵌入模型。通过对四个现实世界数据集的广泛评估，我们证明了对合成数据训练的模型可以很好地推广，并且表现优于现有基线。我们的实现可在https://github.com/ddehun/coling2022_reweighting_sts上公开获得。

Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题