论文标题
Zeroberto:通过主题建模来利用零拍的文本分类
ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling
论文作者
论文摘要
传统的文本分类方法通常需要大量的标记数据,这很难获得,尤其是在受限域或更少的广泛语言中。缺乏标记的数据导致了低资源方法的兴起,该方法假设自然语言处理中的数据可用性较低。其中,零击学习突出,这包括学习分类器而没有任何先前标记的数据。使用这种方法报告的最佳结果使用语言模型,例如变形金刚,但遇到了两个问题:高执行时间和无法处理长文本作为输入。本文提出了一个新的模型Zeroberto,该模型利用无监督的聚类步骤在分类任务之前获得压缩数据表示。我们表明,Zeroberto在长期输入和较短的执行时间方面具有更好的性能,在Folhauol数据集中的F1分数中,XLM-R的表现优于XLM-R约为12%。关键字:低资源NLP,未标记的数据,零照片学习,主题建模,变压器。
Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.