用于科学数据的网络内缓存的访问趋势

论文标题

用于科学数据的网络内缓存的访问趋势

Access Trends of In-network Cache for Scientific Data

论文作者

Han, Ruize, Sim, Alex, Wu, Kesheng, Monga, Inder, Guok, Chin, Würthwein, Frank, Davila, Diego, Balcas, Justas, Newman, Harvey

论文摘要

科学合作越来越多地依靠大量数据来进行工作，其中许多人采用分层系统将数据复制到其全球用户社区。社区中的每个用户通常会为其分析任务选择不同的数据子集；但是，研究小组的成员通常正在研究需要类似数据对象的相关研究主题。因此，可能存在大量数据共享。在这项工作中，我们研究了一个被称为南加州PB尺度缓存的联合存储缓存的访问轨迹。通过研究此缓存系统的访问模式和潜在的网络流量减少，我们旨在探索缓存使用的可预测性以及更一般的网络内部数据缓存的潜力。我们的研究表明，在研究期的一部分中，该分布式存储缓存能够将网络流量量减少2.35。我们进一步表明，机器学习模型可以以0.88的精度预测缓存利用率。这表明这种缓存使用是可以预测的，这对于管理复杂的网络资源（例如网络内存）可能很有用。

Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.

下载PDF全文

下载文献需遵守相关版权规定

论文标题