论文标题

休息:一种用于在安全论坛中识别和分类用户指定信息的线程嵌入方法

REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums

论文作者

Gharibshah, Joobin, Papalexakis, Evangelos E., Faloutsos, Michalis

论文摘要

我们如何从安全论坛中提取有用的信息?我们专注于确定安全专业人员感兴趣的线程:(a)警告令人担忧的事件,例如攻击,(b)提供恶意服务和产品,(c)黑客攻击信息以执行恶意行为,以及(d)与安全有关的经验。尽管最近有一些有希望的工作,但对安全论坛的分析仍处于起步阶段。需要采取新颖的方法来解决该领域的挑战:(a)有效地指定感兴趣的“主题”的困难,以及(b)文本的非结构化和非正式性质。我们提出,休息,一种系统的方法:(a)基于一个可能不完整的单词袋识别感兴趣的线,以及(b)将它们分为上面的四个类之一。这项工作的主要新颖性是一种多步加权嵌入方法:我们在适当的嵌入空间中投射单词,线程和类,并在那里建立相关性和相似性。我们使用来自三个安全论坛的真实数据评估我们的方法,总共有164K帖子和21k线程。首先,对初始关键字选择的RESS鲁棒性可以扩展用户提供的关键字集,因此,它可以从缺少的关键字中恢复。其次,与其他五种方法相比,REST将线程分为兴趣的类别:REST的精度在63.3-76.9%之间。我们将我们的方法视为以用户友好的方式利用在线论坛的丰富信息的第一步,因为用户可以松散地指定她感兴趣的关键字。

How can we extract useful information from a security forum? We focus on identifying threads of interest to a security professional: (a) alerts of worrisome events, such as attacks, (b) offering of malicious services and products, (c) hacking information to perform malicious acts, and (d) useful security-related experiences. The analysis of security forums is in its infancy despite several promising recent works. Novel approaches are needed to address the challenges in this domain: (a) the difficulty in specifying the "topics" of interest efficiently, and (b) the unstructured and informal nature of the text. We propose, REST, a systematic methodology to: (a) identify threads of interest based on a, possibly incomplete, bag of words, and (b) classify them into one of the four classes above. The key novelty of the work is a multi-step weighted embedding approach: we project words, threads and classes in appropriate embedding spaces and establish relevance and similarity there. We evaluate our method with real data from three security forums with a total of 164k posts and 21K threads. First, REST robustness to initial keyword selection can extend the user-provided keyword set and thus, it can recover from missing keywords. Second, REST categorizes the threads into the classes of interest with superior accuracy compared to five other methods: REST exhibits an accuracy between 63.3-76.9%. We see our approach as a first step for harnessing the wealth of information of online forums in a user-friendly way, since the user can loosely specify her keywords of interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源