能源有效的LSTM网络的算法和硬件共同设计，用于视频识别，并使用分层tucker Tensor分解

论文标题

能源有效的LSTM网络的算法和硬件共同设计，用于视频识别，并使用分层tucker Tensor分解

Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition

论文作者

Gong, Yu, Yin, Miao, Huang, Lingyi, Deng, Chunhua, Sui, Yang, Yuan, Bo

论文摘要

长期记忆（LSTM）是一种强大的深神经网络，已在许多序列分析和建模应用中广泛使用。但是，LSTM网络的大型模型大小问题使其实际部署仍然非常具有挑战性，尤其是对于需要高维输入数据的视频识别任务。为了克服这一限制并充分解锁LSTM模型的潜力，在本文中，我们建议对高性能能效LSTM网络执行算法和硬件共同设计。在算法级别上，我们建议开发基于结构的LSTM，即FDHT-LSTM，它具有超低模型的复杂性，同时仍然可以达到高精度。为了完全获得如此有吸引力的算法好处，我们进一步开发了相应的自定义硬件体系结构，以支持提出的FDHT-LSTM模型的有效执行。借助内存访问方案的精致设计，基础硬件可以有效地支持复杂的矩阵变换，而无需以任何访问冲突。我们的评估结果表明，提出的超紧凑型FDHT-LSTM模型和相应的硬件加速器都达到了很高的性能。与最先进的压缩LSTM型号相比，FDHT-LSTM既可以降低模型尺寸的速度顺序，又可以在不同的视频识别数据集中提高准确的准确性。同时，与最先进的张张量分解为模型的硬件领带相比，我们提出的FDHT-LSTM架构分别在吞吐量，区域效率和能源效率方面取得了更好的性能，在LSTM-Youtube工作量上。对于LSTM-UCF的工作量，我们提出的设计还优于较高吞吐量，更高能源效率和可比面积效率的表现。

Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题