论文标题
关键字辅助主题模型
Keyword Assisted Topic Models
论文作者
论文摘要
近年来,基于概率主题模型的全自动内容分析由于其可扩展性而在社会科学家中变得流行。这些模型的无监督性使它们适合在没有先验知识的情况下探索语料库中的主题。但是,研究人员发现,这些模型通常无法通过无意间创建具有相似内容的多个主题并将不同主题结合到一个主题的多个主题来衡量实质性兴趣的特定概念。在本文中,我们从经验上证明,提供少量关键字可以大大提高主题模型的测量绩效。拟议的关键字辅助主题模型(KEYATM)的一个重要优点是,关键字的规范要求研究人员在将模型拟合到数据之前将主题标记。这与损害经验发现的客观性的事后主题解释的广泛实践形成鲜明对比。在我们的应用程序中,我们发现KeyATM提供了更可解释的结果,具有更好的文档分类性能,并且对主题的数量不如标准主题模型敏感。最后,我们表明KeyAtm还可以结合协变量和模型时间趋势。可以使用开源软件包实施提出的方法。
In recent years, fully automated content analysis based on probabilistic topic models has become popular among social scientists because of their scalability. The unsupervised nature of the models makes them suitable for exploring topics in a corpus without prior knowledge. However, researchers find that these models often fail to measure specific concepts of substantive interest by inadvertently creating multiple topics with similar content and combining distinct themes into a single topic. In this paper, we empirically demonstrate that providing a small number of keywords can substantially enhance the measurement performance of topic models. An important advantage of the proposed keyword assisted topic model (keyATM) is that the specification of keywords requires researchers to label topics prior to fitting a model to the data. This contrasts with a widespread practice of post-hoc topic interpretation and adjustments that compromises the objectivity of empirical findings. In our application, we find that keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models. Finally, we show that keyATM can also incorporate covariates and model time trends. An open-source software package is available for implementing the proposed methodology.