Xengine：在异质环境中神经网络的最佳张量重新布置

论文标题

Xengine：在异质环境中神经网络的最佳张量重新布置

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

论文作者

Schuler, Manuela, Membarth, Richard, Slusallek, Philipp

论文摘要

记忆效率对于培训有关资源限制设备的深度学习网络至关重要。在反向传播过程中，向前张量用于计算梯度。尽管可以选择将这些依赖性保留在记忆中，直到将它们重新放置在反向传播中，但可以将某些正向张张器丢弃并从后来从保存的张量，所谓的检查点中重新计算。这尤其允许资源受限的异质环境使用所有可用的计算设备。不幸的是，这些检查点的定义是一个非平凡的问题，对程序员构成了挑战 - 不当或过度的重新计算否定了检查点的好处。在本文中，我们介绍了Xengine，该方法通过确定张量的检查点和重新计算，将网络运算符安排到低内存环境中的异质设备。我们的方法选择每个时间步和操作员的合适资源，并优化了考虑到每个设备的内存限制的神经网络的端到端时间。为此，我们制定了一个混合二次二次程序（MIQP），以安排有关异质系统深度学习网络的操作员。我们将MIQP求解器Xengine与Checkmate进行比较，Checkmate是一种混合企业线性编程（MILP）方法，该方法解决了单个设备上的重新计算。我们的求解器找到的解决方案比仅在单个设备上仅计算网络的最快的检查员计划快22.5％。如果内存限制不允许专门针对图形处理单元，我们还可以找到用于使用中央处理单元和图形处理单元的网络的有效时间表。

Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题