论文标题
Caronte:在未经信任的,备受瞩目的环境上爬行的对抗资源
CARONTE: Crawling Adversarial Resources Over Non-Trusted, High-Profile Environments
论文作者
论文摘要
经常对地下犯罪活动进行监视,以最大程度地提高数据收集并训练ML模型以自动将数据收集工具调整到不同社区。另一方面,复杂的对手可能会采用爬行检测功能,这可能会严重危害研究人员执行数据收集的机会,例如,通过将其帐户放在众人瞩目的范围内并被社区驱逐出境。这在入境成本很大的著名且备受瞩目的犯罪社区中尤为不可取(无论是货币上还是用于背景检查或其他信任建设机制)。本文介绍了Caronte,这是一种半自动学习的工具,实际上可以学习解析和数据算法的任何论坛结构,同时保持数据收集的低调并避免收集大量数据集以维持工具可伸缩性。我们在四个地下论坛上展示了该工具,并比较其生成的网络流量(从对手的位置,即地下社区的服务器中可以看到)与用于网络爬行和人类用户的最新工具。
The monitoring of underground criminal activities is often automated to maximize the data collection and to train ML models to automatically adapt data collection tools to different communities. On the other hand, sophisticated adversaries may adopt crawling-detection capabilities that may significantly jeopardize researchers' opportunities to perform the data collection, for example by putting their accounts under the spotlight and being expelled from the community. This is particularly undesirable in prominent and high-profile criminal communities where entry costs are significant (either monetarily or for example for background checking or other trust-building mechanisms). This paper presents CARONTE, a tool to semi-automatically learn virtually any forum structure for parsing and data-extraction, while maintaining a low profile for the data collection and avoiding the requirement of collecting massive datasets to maintain tool scalability. We showcase the tool against four underground forums, and compare the network traffic it generates (as seen from the adversary's position, i.e. the underground community's server) against state-of-the-art tools for web-crawling as well as human users.