HIMA：一种基于历史记录的快速可扩展的内存访问引擎，用于可区分的神经计算机

论文标题

HIMA：一种基于历史记录的快速可扩展的内存访问引擎，用于可区分的神经计算机

HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer

论文作者

Tao, Yaoyu, Zhang, Zhengya

论文摘要

记忆启动的神经网络（MANN）在外部内存的帮助下在许多任务中提供了更好的推理性能。最近开发的可区分的神经计算机（DNC）是一种曼恩，已证明表现出复杂的数据结构和学习长期依赖性的表现。除了先前使用的基于内容的注意机制外，DNC的较高性能还来自新的基于历史的注意机制。基于历史的机制需要各种新的计算原始原则和状态记忆，这些记忆不受现有神经网络（NN）或MANN加速器的支持。我们展示了Hima，这是一种基于历史的瓷砖内存访问引擎，并在瓷砖中分布式记忆。 HIMA结合了一个多模式网络芯片（NOC），以降低通信延迟并提高可扩展性。采用最佳的子正确记忆分区策略来减少NOC流量的数量；以及两个阶段的用法排序方法利用分布的瓷砖来提高计算速度。为了使HIMA从根本上可扩展，我们创建了一个名为DNC-D的分布式版本，以允许几乎所有内存操作都应用于具有可训练的加权总和以产生全局内存输出的本地记忆。提出了两种近似技术，即使用撇液和软磁性近似，以进一步提高硬件效率。 HIMA原型是在RTL中创建的，并在40nm技术中合成。通过模拟，运行DNC和DNC-D的HIMA表现出6.47倍和39.1倍的速度，22.8倍和164.3倍更好的面积效率，以及61.2倍和61.2x的能源效率优于最先进的MANN ACCELERATOR。与NVIDIA 3080TI GPU相比，HIMA在运行DNC和DNC-D时分别表现出高达437倍和2,646倍的速度。

Memory-augmented neural networks (MANNs) provide better inference performance in many tasks with the help of an external memory. The recently developed differentiable neural computer (DNC) is a MANN that has been shown to outperform in representing complicated data structures and learning long-term dependencies. DNC's higher performance is derived from new history-based attention mechanisms in addition to the previously used content-based attention mechanisms. History-based mechanisms require a variety of new compute primitives and state memories, which are not supported by existing neural network (NN) or MANN accelerators. We present HiMA, a tiled, history-based memory access engine with distributed memories in tiles. HiMA incorporates a multi-mode network-on-chip (NoC) to reduce the communication latency and improve scalability. An optimal submatrix-wise memory partition strategy is applied to reduce the amount of NoC traffic; and a two-stage usage sort method leverages distributed tiles to improve computation speed. To make HiMA fundamentally scalable, we create a distributed version of DNC called DNC-D to allow almost all memory operations to be applied to local memories with trainable weighted summation to produce the global memory output. Two approximation techniques, usage skimming and softmax approximation, are proposed to further enhance hardware efficiency. HiMA prototypes are created in RTL and synthesized in a 40nm technology. By simulations, HiMA running DNC and DNC-D demonstrates 6.47x and 39.1x higher speed, 22.8x and 164.3x better area efficiency, and 6.1x and 61.2x better energy efficiency over the state-of-the-art MANN accelerator. Compared to an Nvidia 3080Ti GPU, HiMA demonstrates speedup by up to 437x and 2,646x when running DNC and DNC-D, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题