论文标题
产生,歧视和对比:半监督句子表示框架
Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework
论文作者
论文摘要
大多数嵌入技术的句子都在很大程度上依赖于昂贵的人宣传的句子对作为监督信号。尽管使用了大型未标记的数据,但在大多数下游任务中,无监督方法的性能通常落后于受监督的对应物。在这项工作中,我们提出了一个半监督句子嵌入框架Gense,该框架有效地利用了大规模的未标记数据。我们的方法包括三个部分:1)生成:生成器/歧视器模型联合训练以合成开放域未标记的语料库的句子对; 2)歧视:吵架对歧视者的噪声对获得高质量的正句和否定句子对; 3)对比:通过带注释和合成的数据,提出了一种基于及时的对比方法,用于句子表示学习。全面的实验表明,GENSE在STS数据集上达到了85.19的平均相关得分,并且在四个领域适应任务上的绩效一致,大大超过了最先进的方法,并令人信服地证实了其有效性和广泛性能力。
Most sentence embedding techniques heavily rely on expensive human-annotated sentence pairs as the supervised signals. Despite the use of large-scale unlabeled data, the performance of unsupervised methods typically lags far behind that of the supervised counterparts in most downstream tasks. In this work, we propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data. Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data. Comprehensive experiments show that GenSE achieves an average correlation score of 85.19 on the STS datasets and consistent performance improvement on four domain adaptation tasks, significantly surpassing the state-of-the-art methods and convincingly corroborating its effectiveness and generalization ability.Code, Synthetic data and Models available at https://github.com/MatthewCYM/GenSE.