门控会产生缓慢的模式，并控制GRU和LSTMS中的相位复杂性

论文标题

门控会产生缓慢的模式，并控制GRU和LSTMS中的相位复杂性

Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs

论文作者

Can, Tankut, Krishnamurthy, Kamesh, Schwab, David J.

论文摘要

复发性神经网络（RNN）是具有复杂时间结构的数据的强大动态模型。但是，由于梯度爆炸或消失，培训RNNS传统上被证明是具有挑战性的。 LSTMS和GRU（及其变体）等RNN模型通过将各种类型的门控单元引入体系结构中，可以显着减轻与培训相关的问题。尽管这些大门从经验上提高了性能，但盖茨的增加如何影响GRUS和LSTMS的动力学和训练性。在这里，我们采用了将随机初始化的LSTM和GRU作为动力学系统进行研究的角度，并询问显着的动力学特性是如何由大门塑造的。我们利用随机矩阵理论和平均场理论的工具来研究GRUS和LSTMS的州对国家。我们表明，GRU中的更新门和LSTM中的忘记门会导致动态慢速模式的积累。此外，GRU更新门可以在边缘稳定的点上固定系统。 GRU中的重置门，LSTM中的输出和输入门控制Jacobian的光谱半径，而GRU RESET GATE还调节了固定点的景观的复杂性。此外，对于GRU，我们获得了描述固定点的统计特性的相图。我们还提供了训练性能与通过改变超参数实现的各种动力学制度的初步比较。展望未来，我们引入了一系列强大的技术，这些技术可以适应广泛的RNN，以研究各种建筑选择对动态的影响，并有可能激励对新型建筑的原则发现。

Recurrent neural networks (RNNs) are powerful dynamical models for data with complex temporal structure. However, training RNNs has traditionally proved challenging due to exploding or vanishing of gradients. RNN models such as LSTMs and GRUs (and their variants) significantly mitigate these issues associated with training by introducing various types of gating units into the architecture. While these gates empirically improve performance, how the addition of gates influences the dynamics and trainability of GRUs and LSTMs is not well understood. Here, we take the perspective of studying randomly initialized LSTMs and GRUs as dynamical systems, and ask how the salient dynamical properties are shaped by the gates. We leverage tools from random matrix theory and mean-field theory to study the state-to-state Jacobians of GRUs and LSTMs. We show that the update gate in the GRU and the forget gate in the LSTM can lead to an accumulation of slow modes in the dynamics. Moreover, the GRU update gate can poise the system at a marginally stable point. The reset gate in the GRU and the output and input gates in the LSTM control the spectral radius of the Jacobian, and the GRU reset gate also modulates the complexity of the landscape of fixed-points. Furthermore, for the GRU we obtain a phase diagram describing the statistical properties of fixed-points. We also provide a preliminary comparison of training performance to the various dynamical regimes realized by varying hyperparameters. Looking to the future, we have introduced a powerful set of techniques which can be adapted to a broad class of RNNs, to study the influence of various architectural choices on dynamics, and potentially motivate the principled discovery of novel architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题