简短的文本分类方法来识别儿童性剥削材料

论文标题

简短的文本分类方法来识别儿童性剥削材料

Short Text Classification Approach to Identify Child Sexual Exploitation Material

论文作者

Al-Nabki, Mhd Wesam, Fidalgo, Eduardo, Alegre, Enrique, Alaiz-Rodríguez, Rocío

论文摘要

生产或共享儿童性剥削材料（CSEM）是执法机构（LEAS）剧烈抗争的严重罪行。当LEA从CSEM的潜在生产商或消费者那里抓住一台计算机时，他们需要分析犯罪嫌疑人的硬盘文件，以寻找证据。但是，查找CSEM的文件内容的手动检查是一项耗时的任务。在大多数情况下，使用搜查令在西班牙警察可用的时间内是不可行的。另一种可用于加快过程的方法不是分析其内容，而是通过分析文件名及其绝对路径来识别CSEM。此任务的主要挑战在于使用混淆的单词和用户定义的命名模式，该材料的所有者故意处理简短的文本。本文介绍并比较了基于简短文本分类的两种方法，以识别CSEM文件。第一个使用两个独立的监督分类器，一个用于文件名，另一个用于路径，后来将其输出融合为单个分数。相反，第二种方法仅使用文件名分类器在文件的绝对路径上迭代。两种方法在字符n-grams级别上运行，而二进制和拼字功能丰富了文件名表示，并且使用二进制逻辑回归模型进行分类。提出的文件分类器的平均类别召回率为0.98。该解决方案可以集成到法医工具和服务中，以支持执法机构，以识别CSEM而不应对每个文件的视觉内容，这在计算上的要求更高。

Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题