论文标题
可可:在数据限制下与对比度学习的相干增强机器生成的文本检测
CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning
论文作者
论文摘要
机器生成的文本(MGT)检测是将MGT与人写的文本(HWT)区分开的任务,在防止滥用文本生成模型方面起着至关重要的作用,最近在模仿人类写作风格中表现出色。最新提出的检测器通常将粗文本序列作为输入和微调预验证的模型,并具有标准的跨透镜损失。但是,这些方法无法考虑文本的语言结构。此外,他们缺乏处理低资源问题的能力,在实践中考虑大量文本数据在线可能会发生这种情况。在本文中,我们提出了一个基于连贯的对比学习模型,名为COCO,以检测低资源场景下可能的MGT。为了利用语言特征,我们以图形形式对文本表示形式进行编码相干信息。为了应对低数据资源的挑战,我们采用了一个对比度学习框架,并提出了改善的对比损失,以防止简单样本带来的性能降解。该实验在两个公共数据集和两个自我构造的数据集上结果证明,我们的方法的表现明显优于制作方法。另外,我们出人意料地发现,在我们的实验中,MGT源自最新的语言模型可能比以前的模型更容易检测到。我们为这种违反直觉现象提出了一些初步解释。所有代码和数据集都是开源的。
Machine-Generated Text (MGT) detection, a task that discriminates MGT from Human-Written Text (HWT), plays a crucial role in preventing misuse of text generative models, which excel in mimicking human writing style recently. Latest proposed detectors usually take coarse text sequences as input and fine-tune pretrained models with standard cross-entropy loss. However, these methods fail to consider the linguistic structure of texts. Moreover, they lack the ability to handle the low-resource problem which could often happen in practice considering the enormous amount of textual data online. In this paper, we present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario. To exploit the linguistic feature, we encode coherence information in form of graph into text representation. To tackle the challenges of low data resource, we employ a contrastive learning framework and propose an improved contrastive loss for preventing performance degradation brought by simple samples. The experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly. Also, we surprisingly find that MGTs originated from up-to-date language models could be easier to detect than these from previous models, in our experiments. And we propose some preliminary explanations for this counter-intuitive phenomena. All the codes and datasets are open-sourced.