论文标题

关于多域短文本上伯托的概括性的实验

Experiments on Generalizability of BERTopic on Multi-Domain Short Text

论文作者

de Groot, Muriël, Aliannejadi, Mohammad, Haas, Marcel R.

论文摘要

主题建模广泛用于分析评估大量文本数据的收集。最受欢迎的主题技术之一是潜在的Dirichlet分配(LDA),它具有灵活性和适应性,但对例如来自各个领域的简短文字。我们探讨了最先进的偏鸟算法在简短的多域文本上的性能,并发现它在主题相干性和多样性方面比LDA更好地概括了。我们进一步分析了Bertopic使用的HDBSCAN聚类算法的性能,并发现它将大多数文档分类为离群值。这个关键但监督的问题将太多文件排除在进一步分析之外。当我们用K-均值替换HDBSCAN时,我们的性能相似,但没有异常值。

Topic modeling is widely used for analytically evaluating large collections of textual data. One of the most popular topic techniques is Latent Dirichlet Allocation (LDA), which is flexible and adaptive, but not optimal for e.g. short texts from various domains. We explore how the state-of-the-art BERTopic algorithm performs on short multi-domain text and find that it generalizes better than LDA in terms of topic coherence and diversity. We further analyze the performance of the HDBSCAN clustering algorithm utilized by BERTopic and find that it classifies a majority of the documents as outliers. This crucial, yet overseen problem excludes too many documents from further analysis. When we replace HDBSCAN with k-Means, we achieve similar performance, but without outliers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源