论文标题
在潜在语义空间中较短文本的作者聚类的框架
A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces
论文作者
论文摘要
作者集群涉及同一作者或作者团队编写的文件分组,而没有任何作者写作风格或主题偏好的积极示例。对于较短的文本(通常比传统文档短的段落长度文本)的作者聚类,文档表示尤其重要:非常高维的特征空间会导致数据稀疏性,并遭受严重的后果,例如尺寸的诅咒,而特征选择可能会导致信息丢失。我们提出了一个高级框架,该框架在带有非参数主题建模的潜在特征空间中使用紧凑的数据表示。此后在两种情况下确定了作者集群:(a)完全无监督的和(b)半监督,其中已知少数短文本属于同一作者(必须链接约束)或不属于(不能链接约束)。我们报告了具有三种语言和两种类型的120种集合的实验,并表明基于主题的潜在特征空间提供了有希望的性能水平,同时与最先进的尺寸相比,尺寸降低了1500倍。我们还证明,尽管对确切数量的作者(即作者群)的先验知识并没有对额外的质量做出很大的贡献,但对作者群集成员资格的约束的知识很少,从而可以在这项艰巨的任务前面明确提高绩效。对标准指标进行彻底的实验表明,仍然有足够的改进空间进行作者聚类,尤其是文本较短
Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints). We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-arts. We also demonstrate that, while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task. Thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts