论文标题
鸡蛋:社交网络垃圾邮件的关系建模的灵活方法
EGGS: A Flexible Approach to Relational Modeling of Social Network Spam
论文作者
论文摘要
社交网站将面临不断的垃圾邮件,不必要的信息,这些信息会分散注意力,烦恼甚至欺骗诚实的用户。这些信息往往很短,因此难以孤立地识别。此外,垃圾邮件发送者会掩盖他们的消息以使其看起来合法,从而诱使用户单击链接并欺骗垃圾邮件过滤器来容忍其恶意行为。因此,一些垃圾邮件过滤器检查域中的关系结构,例如用户和消息之间的连接,以更好地识别欺骗性内容。但是,即使使用它,关系结构也经常以不完整或临时的方式利用。在本文中,我们介绍了用于垃圾邮件的扩展基于组的图形模型(EGGS),这是一种用于在线社交网络中垃圾邮件分类的通用方法。当它们具有同一作者,相同的内容或其他特定于域特定的连接时,我们没有独立标记每个消息,而是将相关的消息分组在一起。为了理解相关消息,我们结合了两种流行的方法:堆叠的图形学习(SGL)和概率图形模型(PGM)。两种方法都捕获了这样的想法,即当相关消息也是垃圾邮件时,消息更有可能是垃圾邮件,但它们以不同的方式这样做。 SGL使用顺序分类器预测,PGM使用概率推断。我们将方法应用于四个不同的社交网络领域。在大多数实验环境中,鸡蛋比独立模型更准确,尤其是当正确的标签不确定时。对于PGM实现,我们将Markov逻辑网络与概率软逻辑进行了比较,并发现两者都在一个既没有主导的情况下都可以很好地工作,而SGL和PGM的组合通常比单独使用的表现更好。
Social networking websites face a constant barrage of spam, unwanted messages that distract, annoy, and even defraud honest users. These messages tend to be very short, making them difficult to identify in isolation. Furthermore, spammers disguise their messages to look legitimate, tricking users into clicking on links and tricking spam filters into tolerating their malicious behavior. Thus, some spam filters examine relational structure in the domain, such as connections among users and messages, to better identify deceptive content. However, even when it is used, relational structure is often exploited in an incomplete or ad hoc manner. In this paper, we present Extended Group-based Graphical models for Spam (EGGS), a general-purpose method for classifying spam in online social networks. Rather than labeling each message independently, we group related messages together when they have the same author, the same content, or other domain-specific connections. To reason about related messages, we combine two popular methods: stacked graphical learning (SGL) and probabilistic graphical models (PGM). Both methods capture the idea that messages are more likely to be spammy when related messages are also spammy, but they do so in different ways; SGL uses sequential classifier predictions and PGMs use probabilistic inference. We apply our method to four different social network domains. EGGS is more accurate than an independent model in most experimental settings, especially when the correct label is uncertain. For the PGM implementation, we compare Markov logic networks to probabilistic soft logic and find that both work well with neither one dominating, and the combination of SGL and PGMs usually performs better than either on its own.