论文标题

Zeroberto:通过主题建模来利用零拍的文本分类

ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

论文作者

Alcoforado, Alexandre, Ferraz, Thomas Palmeira, Gerber, Rodrigo, Bustos, Enzo, Oliveira, André Seidel, Veloso, Bruno Miguel, Siqueira, Fabio Levy, Costa, Anna Helena Reali

论文摘要

传统的文本分类方法通常需要大量的标记数据,这很难获得,尤其是在受限域或更少的广泛语言中。缺乏标记的数据导致了低资源方法的兴起,该方法假设自然语言处理中的数据可用性较低。其中,零击学习突出,这包括学习分类器而没有任何先前标记的数据。使用这种方法报告的最佳结果使用语言模型,例如变形金刚,但遇到了两个问题:高执行时间和无法处理长文本作为输入。本文提出了一个新的模型Zeroberto,该模型利用无监督的聚类步骤在分类任务之前获得压缩数据表示。我们表明,Zeroberto在长期输入和较短的执行时间方面具有更好的性能,在Folhauol数据集中的F1分数中,XLM-R的表现优于XLM-R约为12%。关键字:低资源NLP,未标记的数据,零照片学习,主题建模,变压器。

Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源