论文标题

使用存档数据对学术网页进行建模更新

Modeling Updates of Scholarly Webpages Using Archived Data

论文作者

Jayawardana, Yasith, Nwala, Alexander C., Jayawardena, Gavindya, Wu, Jian, Jayarathna, Sampath, Nelson, Michael L., Giles, C. Lee

论文摘要

网络的广阔程度在构建资源有限的大规模搜索引擎上施加了高昂的成本。因此,需要优化爬行边界,以改善爬行内容的覆盖范围和新鲜度。在本文中,我们提出了一种使用网页存档副本来建模Web变化动态的方法。为了评估其效用,我们使用从其Google Scholar概况中获得的19,977个作者主页进行了对学术网络的初步研究。我们首先从Internet存档(IA)获得这些网页的存档副本,并估计它们的实际更新何时发生。接下来,我们应用最大可能性来估计其平均更新频率($λ$)值。我们的评估表明,从存档数据的简短历史记录中得出的$λ$值提供了对短期真实更新频率的良好估计,并且与基线模型相比,我们的方法可以更好地估计资源的更新。基于此,我们演示了存档数据的实用性,以优化网络爬网的爬行策略,并发现激发未来研究方向的重要挑战。

The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors' homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ($λ$) values. Our evaluation shows that $λ$ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源