TF.DATA服务：分解ML输入数据处理的案例

论文标题

TF.DATA服务：分解ML输入数据处理的案例

tf.data service: A Case for Disaggregating ML Input Data Processing

论文作者

Audibert, Andrew, Chen, Yang, Graur, Dan, Klimovic, Ana, Simsa, Jiri, Thekkath, Chandramohan A.

论文摘要

机器学习（ML）计算通常会在昂贵的专用硬件（例如GPU和TPU）上执行，这些硬件可提供高度的拖鞋和每瓦的性能。为了提高成本效率，必须高度利用这些加速器。这需要以加速器可以在数据上摄取和执行ML计算的速率的预处理数据。为了避免数据失速，用于ML计算的每个加速器核心所需的主机CPU和RAM随着作业而变化。因此，以固定硬件比率处理ML加速器主机的输入数据的传统方法导致利用加速器或主机CPU和RAM。在本文中，我们通过构建分解的ML数据处理系统来解决这些问题。我们提供TF.DATA服务，这是一种开源分解输入数据处理服务，构建了Tensorflow中TF.DATA顶部的开源分类数据处理服务。我们表明，分类数据预处理具有大规模ML培训工作的三个关键优势。首先，该服务可以水平扩展到右尺寸的CPU/RAM主机资源，以便在每个作业中进行数据处理，节省32X培训时间和26倍的成本，平均为26倍。其次，该服务可以在作业中共享短暂的预处理数据结果，以优化CPU使用并减少冗余计算。最后，该服务支持协调的读取，该技术由于分布式培训中的不同输入大小而避免散乱者，平均将训练时间减少了2.2倍。我们的设计灵感来自从生产中部署TF.DATA服务的经验教训，包括放松数据访问保证而不影响模型准确性。

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题