论文标题
使用n-gram和样式特征对微博的法医作者资格分析
Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features
论文作者
论文摘要
近年来,互联网上发布的消息和文字用于刑事调查。不幸的是,其中许多人的作者仍然未知。在某些渠道中,建立作者身份的问题可能会更困难,因为数字文本的长度仅限于一定数量的字符。在这项工作中,我们旨在确定限制在280个字符的推文消息的作者。我们评估了传统上在作者身份归因中使用的流行功能,这些功能在不同层面上捕获写作风格的特性。我们用于实验一个由40个用户组成的自捕获数据库,每个用户有120至200个推文。使用此小集合的结果是有希望的,其不同功能提供了92%至98.5%的分类精度。与现有研究相比,这些结果具有竞争力,这些研究采用了短文,例如推文或SMS。
In recent years, messages and text posted on the Internet are used in criminal investigations. Unfortunately, the authorship of many of them remains unknown. In some channels, the problem of establishing authorship may be even harder, since the length of digital texts is limited to a certain number of characters. In this work, we aim at identifying authors of tweet messages, which are limited to 280 characters. We evaluate popular features employed traditionally in authorship attribution which capture properties of the writing style at different levels. We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user. Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%. These results are competitive in comparison to existing studies which employ short texts such as tweets or SMS.