论文标题

X级:具有极弱的监督的文本分类

X-Class: Text Classification with Extremely Weak Supervision

论文作者

Wang, Zihan, Mekala, Dheeraj, Shang, Jingbo

论文摘要

在本文中,我们探讨了文本分类,并以极度弱的监督,即仅依靠班级名称的表面文本。这是一个比种子驱动的弱监督更具挑战性的环境,该监督允许每个课程几个种子单词。我们选择从表示学习的角度来攻击这个问题 - 理想的文档表示形式应导致聚类和所需分类之间几乎相同的结果。特别是,人们可以以不同的方式对相同的语料库进行分类(例如,基于主题和位置),因此文档表示应适应给定的类名称。我们提出了一个新颖的X级框架来实现自适应表示。具体来说,我们首先通过将最相似的单词添加到每个班级之前,直到不一致。按照量身定制的集体注意机制的混合物,我们通过上下文化的单词表示的加权平均值获得文档表示。随着每个文档分配给其最近的类的先验,我们然后将文档与类对齐。最后,我们从每个集群中选择最自信的文档来培训文本分类器。广泛的实验表明,在7个基准数据集上,X级可以与种子驱动的弱监督方法竞争,甚至超过种子驱动的方法。我们的数据集和代码在https://github.com/zihanwangki/xclass/上发布。

In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective -- ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets. Our dataset and code are released at https://github.com/ZihanWangKi/XClass/ .

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源