可可：在数据限制下与对比度学习的相干增强机器生成的文本检测

论文标题

可可：在数据限制下与对比度学习的相干增强机器生成的文本检测

CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning

论文作者

Liu, Xiaoming, Zhang, Zhaohan, Wang, Yichen, Pu, Hang, Lan, Yu, Shen, Chao

论文摘要

机器生成的文本（MGT）检测是将MGT与人写的文本（HWT）区分开的任务，在防止滥用文本生成模型方面起着至关重要的作用，最近在模仿人类写作风格中表现出色。最新提出的检测器通常将粗文本序列作为输入和微调预验证的模型，并具有标准的跨透镜损失。但是，这些方法无法考虑文本的语言结构。此外，他们缺乏处理低资源问题的能力，在实践中考虑大量文本数据在线可能会发生这种情况。在本文中，我们提出了一个基于连贯的对比学习模型，名为COCO，以检测低资源场景下可能的MGT。为了利用语言特征，我们以图形形式对文本表示形式进行编码相干信息。为了应对低数据资源的挑战，我们采用了一个对比度学习框架，并提出了改善的对比损失，以防止简单样本带来的性能降解。该实验在两个公共数据集和两个自我构造的数据集上结果证明，我们的方法的表现明显优于制作方法。另外，我们出人意料地发现，在我们的实验中，MGT源自最新的语言模型可能比以前的模型更容易检测到。我们为这种违反直觉现象提出了一些初步解释。所有代码和数据集都是开源的。

Machine-Generated Text (MGT) detection, a task that discriminates MGT from Human-Written Text (HWT), plays a crucial role in preventing misuse of text generative models, which excel in mimicking human writing style recently. Latest proposed detectors usually take coarse text sequences as input and fine-tune pretrained models with standard cross-entropy loss. However, these methods fail to consider the linguistic structure of texts. Moreover, they lack the ability to handle the low-resource problem which could often happen in practice considering the enormous amount of textual data online. In this paper, we present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario. To exploit the linguistic feature, we encode coherence information in form of graph into text representation. To tackle the challenges of low data resource, we employ a contrastive learning framework and propose an improved contrastive loss for preventing performance degradation brought by simple samples. The experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly. Also, we surprisingly find that MGTs originated from up-to-date language models could be easier to detect than these from previous models, in our experiments. And we propose some preliminary explanations for this counter-intuitive phenomena. All the codes and datasets are open-sourced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题