论文标题
Schrödinger的FP:用于深度学习训练的浮点容器的动态改编
Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training
论文作者
论文摘要
在神经网络训练期间,张量从/到记忆的转移占主导地位和能量。为了提高能源效率和性能,研究一直在探索使用较窄数据表示的方法。到目前为止,这些尝试依靠用户指导的反复试验来实现融合。我们提出可以减轻用户免除此责任的方法。我们的方法动态调整用于激活和权重的浮点容器的大小和格式,在训练过程中实现适应性,i)i)要使用的数据类型,ii)在哪个张量和iii)如何随时间变化。指数和mantsas的不同含义和分布使我们为每种方法量身定制方法。我们提出了两对方法,以消除尽可能多的Mantissa和指数位,而不会影响准确性。量子Mantissa和Quantum指数是机器学习压缩方法,它们可以利用梯度下降算法,以学习每层颗粒上的最小Mantissa和指数位长度。他们会自动了解到,许多张量只能使用1或2个Mantissa位和3或4个指数位。总体而言,两种机器学习方法将足迹降低了$ 4.74 \ times $。另外,Bitwave观察训练期间损耗功能的变化,以调整曼蒂萨(Mantissa)和指数范围内的指数位,从而产生了3.19美元的$ 3.19 \ times $ $减少。最后,我们提出了一种可选方法壁虎,以利用自然出现的LOP侧指数分布到量子指数或BitWave的无损压缩指数,并且平均而言,将压缩率提高到$ 5.64 \ $ $ $ $ $ $ $ 4.56 \ times $ $。
The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the size and format of the floating-point containers used for activations and weights during training, achieving adaptivity across three dimensions: i) which datatype to use, ii) on which tensor, and iii) how it changes over time. The different meanings and distributions of exponent and mantissas lead us to tailored approaches for each. We present two lossy pairs of methods to eliminate as many mantissa and exponent bits as possible without affecting accuracy. Quantum Mantissa and Quantum Exponent are machine learning compression methods that tap into the gradient descent algorithm to learn the minimal mantissa and exponent bitlengths on a per-layer granularity. They automatically learn that many tensors can use just 1 or 2 mantissa bits and 3 or 4 exponent bits. Overall, the two machine learning methods reduce the footprint by $4.74\times$. Alternatively, BitWave observes changes in the loss function during training to adjust mantissa and exponent bitlengths network-wide, yielding a $3.19\times$ reduction in footprint. Finally, we present an optional method, Gecko, to exploit the naturally emerging, lop-sided exponent distribution to losslessly compress resulting exponents from Quantum Exponent or BitWave and, on average, improve compression rates to $5.64\times$ and $4.56\times$.