论文标题
估计对单个分类器的预测及其对类型分类任务的共同的信心
Estimating Confidence of Predictions of Individual Classifiers and Their Ensembles for the Genre Classification Task
论文作者
论文摘要
流派识别是非主题文本分类的子类。此任务和主题分类之间的主要区别在于,与主题不同,通常不对应简单的关键字,因此需要根据其在通信中的功能来定义它们。基于预先训练的变压器(例如BERT或XLM-ROBERTA)的神经模型表明SOTA会导致许多NLP任务,包括非主流分类。但是,在许多情况下,当某些RAW文本与培训集的配置文件不符合时,由于数据集的变化,它们的下游应用程序(例如从社交媒体中提取的)应用程序可能会导致不可靠的结果。为了减轻这个问题,我们尝试了各个模型及其合奏。为了评估所有模型的鲁棒性,我们使用预测置信度度量,该指标估计了在没有黄金标准标签的情况下预测的可靠性。我们可以通过正确分类的文本与标记的测试语料库中错误分类的文本之间的置信度差距评估鲁棒性,更高的差距使我们更容易提高我们的信心,即我们的分类器做出正确的决定。我们的结果表明,对于本研究中测试的所有分类器,存在一个置信差距,但是对于合奏,差距更大,这意味着合奏比单个模型更强大。
Genre identification is a subclass of non-topical text classification. The main difference between this task and topical classification is that genres, unlike topics, usually do not correspond to simple keywords, and thus they need to be defined in terms of their functions in communication. Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification. However, in many cases, their downstream application to very large corpora, such as those extracted from social media, can lead to unreliable results because of dataset shifts, when some raw texts do not match the profile of the training set. To mitigate this problem, we experiment with individual models as well as with their ensembles. To evaluate the robustness of all models we use a prediction confidence metric, which estimates the reliability of a prediction in the absence of a gold standard label. We can evaluate robustness via the confidence gap between the correctly classified texts and the misclassified ones on a labeled test corpus, higher gaps make it easier to improve our confidence that our classifier made the right decision. Our results show that for all of the classifiers tested in this study, there is a confidence gap, but for the ensembles, the gap is bigger, meaning that ensembles are more robust than their individual models.