论文标题
使用多个分类技术点击检测
Clickbait Detection using Multiple Categorization Techniques
论文作者
论文摘要
点击诱饵是在线文章,具有故意设计的误导性标题,以吸引越来越多的读者打开预期的网页。点击诱饵用于诱惑的访问者单击特定的链接,以使着陆页货币化或传播虚假新闻以进行轰动。在任何新闻聚合器门户网站上的点击诱饵的存在可能会给读者带来不愉快的体验。对于机器学习社区而言,自动发现新闻头条的Clickbait头条新闻是一个充满挑战的问题。已经提出了许多方法来防止最近的点击诱饵文章。但是,在检测点击诱饵中可用的最新技术并不强大。本文提出了一种混合分类技术,用于通过整合不同的特征,句子结构和聚类来分离点击诱饵和非扣押文章。在初步分类期间,使用11个功能将头条新闻分开。之后,使用句子形式,句法相似性度量将头条新闻重新分类。在最后阶段,通过使用基于t-stochantic邻域嵌入(T-SNE)方法的单词矢量相似性应用聚类来再次重新分类头条新闻。分类这些头条新闻后,将机器学习模型应用于数据集以评估机器学习算法。获得的实验结果表明,所提出的混合模型比我们使用的实际数据集的任何单个分类技术更强大,可靠和有效。
Clickbaits are online articles with deliberately designed misleading titles for luring more and more readers to open the intended web page. Clickbaits are used to tempted visitors to click on a particular link either to monetize the landing page or to spread the false news for sensationalization. The presence of clickbaits on any news aggregator portal may lead to unpleasant experience to readers. Automatic detection of clickbait headlines from news headlines has been a challenging issue for the machine learning community. A lot of methods have been proposed for preventing clickbait articles in recent past. However, the recent techniques available in detecting clickbaits are not much robust. This paper proposes a hybrid categorization technique for separating clickbait and non-clickbait articles by integrating different features, sentence structure, and clustering. During preliminary categorization, the headlines are separated using eleven features. After that, the headlines are recategorized using sentence formality, syntactic similarity measures. In the last phase, the headlines are again recategorized by applying clustering using word vector similarity based on t-Stochastic Neighbourhood Embedding (t-SNE) approach. After categorization of these headlines, machine learning models are applied to the data set to evaluate machine learning algorithms. The obtained experimental results indicate the proposed hybrid model is more robust, reliable and efficient than any individual categorization techniques for the real-world dataset we used.