用于估计网页变更率的在线算法

论文标题

用于估计网页变更率的在线算法

Online Algorithms for Estimating Change Rates of Web Pages

论文作者

Avrachenkov, Konstantin, Patil, Kishor, Thoppe, Gugan

论文摘要

搜索引擎维护不同网页的本地副本，以提供快速搜索结果。该本地缓存由Web轨道搜寻器保持最新状态，该搜寻器经常访问这些不同的页面以跟踪其中的变化。理想情况下，应在网络上更改页面后立即更新本地副本。但是，有限的带宽可用性和服务器限制限制了可以将不同页面拖延的频率。这带来了以下优化问题：最大化本地高速缓存的新鲜度，但要受到规定范围内的爬行频率的影响。虽然确实存在可解决此问题的可处理算法，但这些问题要么假设确切的页面变更率的知识，要么使用效率低下的方法（例如MLE）来估计相同的方法。我们在这里解决这个问题。我们提供了三个新型方案，用于在线估算页面变更率，所有迭代的运行时间都极低。第一个是基于大数字定律，第二个是基于随机近似的。第三个是第二次的延伸，其中包括一个重球动量术语。所有这些方案只需要有关页面更改过程的部分信息，即，自上次爬行实例以来，他们只需要知道该页面是否已更改。我们的主要理论结果涉及这三个方案的渐近收敛和收敛速率。实际上，当梯度和噪声方差均不统一时，我们的工作是第一个显示原始随机重球方法的收敛性。我们还提供了一些数值实验（基于实际和合成数据），以证明我们所提出的估计量的优越性，而不是MLE等现有估计值。我们强调，我们的算法也很容易适用于数据库和网络库存管理的同步。

A search engine maintains local copies of different web pages to provide quick search results. This local cache is kept up-to-date by a web crawler that frequently visits these different pages to track changes in them. Ideally, the local copy should be updated as soon as a page changes on the web. However, finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. This brings forth the following optimization problem: maximize the freshness of the local cache subject to the crawling frequencies being within prescribed bounds. While tractable algorithms do exist to solve this problem, these either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We address this issue here. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per iteration. The first is based on the law of large numbers and the second on stochastic approximation. The third is an extension of the second and includes a heavy-ball momentum term. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. Our main theoretical results concern asymptotic convergence and convergence rates of these three schemes. In fact, our work is the first to show convergence of the original stochastic heavy-ball method when neither the gradient nor the noise variance is uniformly bounded. We also provide some numerical experiments (based on real and synthetic data) to demonstrate the superiority of our proposed estimators over existing ones such as MLE. We emphasize that our algorithms are also readily applicable to the synchronization of databases and network inventory management.

下载PDF全文

下载文献需遵守相关版权规定

论文标题