论文标题

分解:用于大规模个性化建议的架构分类系统

DisaggRec: Architecting Disaggregated Systems for Large-Scale Personalized Recommendation

论文作者

Ke, Liu, Zhang, Xuan, Lee, Benjamin, Suh, G. Edward, Lee, Hsien-Hsin S.

论文摘要

基于深度学习的个性化推荐系统被广泛用于生产数据中心的在线用户面向服务,在该服务中,在不中断的情况下,采购了大量的硬件资源,并设法可靠地提供了低延迟服务。随着建议模型的规模不断发展和增长,我们的分析项目与整体服务器一起部署的数据中心将花费多达12.4倍的总拥有成本(TCO),以满足未来三年内模型规模和复杂性的要求。此外,通过深入的表征,我们揭示了基于单片服务器的群集会通过以固定比例配置资源来遭受资源空闲性和浪费高达30%的TCO。为了应对这一挑战,我们提出了DISAGGREC,这是一种用于大规模推荐服务的分类系统。 Disaggrec实现了计算和内存资源的独立脱钩扩展,以符合快速发展的工作负载的不断变化的需求。它还通过隔离计算节点和内存节点的故障来提高系统可靠性。分解的这两个主要好处共同将TCO降低了49.3%。此外,分类可以使未来数据中心中硬件异质性的灵活和敏捷提供。通过部署具有接近内存处理能力的新硬件,我们的评估表明,分类集群在基于整体服务器的群集上的三年型号演变中,在基于单片服务器的群集上节省了21%-43.6%的TCO节省。

Deep learning-based personalized recommendation systems are widely used for online user-facing services in production datacenters, where a large amount of hardware resources are procured and managed to reliably provide low-latency services without disruption. As the recommendation models continue to evolve and grow in size, our analysis projects that datacenters deployed with monolithic servers will spend up to 12.4x total cost of ownership (TCO) to meet the requirement of model size and complexity over the next three years. Moreover, through in-depth characterization, we reveal that the monolithic server-based cluster suffers resource idleness and wastes up to 30% TCO by provisioning resources in fixed proportions. To address this challenge, we propose DisaggRec, a disaggregated system for large-scale recommendation serving. DisaggRec achieves the independent decoupled scaling-out of the compute and memory resources to match the changing demands from fast-evolving workloads. It also improves system reliability by segregating the failures of compute nodes and memory nodes. These two main benefits from disaggregation collectively reduce the TCO by up to 49.3%. Furthermore, disaggregation enables flexible and agile provisioning of increasing hardware heterogeneity in future datacenters. By deploying new hardware featuring near-memory processing capability, our evaluation shows that the disaggregated cluster achieves 21%-43.6% TCO savings over the monolithic server-based cluster across a three-year span of model evolution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源