深神经网络的统计物理：朝着最佳渠道的初始化

论文标题

深神经网络的统计物理：朝着最佳渠道的初始化

Statistical Physics of Deep Neural Networks: Initialization toward Optimal Channels

论文作者

Weng, Kangyu, Cheng, Aohua, Zhang, Ziyang, Sun, Pei, Tian, Yang

论文摘要

在深度学习中，神经网络是输入数据及其表示之间的嘈杂渠道。这种观点自然地将深度学习与追求在信息传输和表示方面具有最佳性能构建渠道的追求。尽管在网络优化期间，大量的努力集中在实现最佳的信道特性上，但我们研究了可以将神经网络朝向最佳渠道初始化的可能性。我们的理论与实验验证一致，它确定了这种未知可能性的基础机制，并提出了统计物理学与深度学习之间的内在联系。与使用经典均值拟合近似的神经网络表征的传统理论不同，我们提供了分析证明，这种广泛应用的简化方案在将神经网络作为信息渠道研究中无效。为了填补这一空白，我们开发了一个校正的均值场框架，可用于表征神经网络中信息传播的限制行为，而没有对输入的强烈假设。基于它，我们提出了一种分析理论，以证明当在动态等静脉静脉统像下初始化神经网络时，在输入和传播信号之间实现了相互信息最大化，在这种情况下，信息通过规范传播映射传输信息。这些理论预测通过对真实神经网络的实验来验证，这表明我们理论对有限尺寸效应的鲁棒性。最后，我们用信息瓶颈理论分析了我们的发现，以确认动态等轴测图，相互信息最大化和最佳通道特性之间的精确关系。

In deep learning, neural networks serve as noisy channels between input data and its representation. This perspective naturally relates deep learning with the pursuit of constructing channels with optimal performance in information transmission and representation. While considerable efforts are concentrated on realizing optimal channel properties during network optimization, we study a frequently overlooked possibility that neural networks can be initialized toward optimal channels. Our theory, consistent with experimental validation, identifies primary mechanics underlying this unknown possibility and suggests intrinsic connections between statistical physics and deep learning. Unlike the conventional theories that characterize neural networks applying the classic mean-filed approximation, we offer analytic proof that this extensively applied simplification scheme is not valid in studying neural networks as information channels. To fill this gap, we develop a corrected mean-field framework applicable for characterizing the limiting behaviors of information propagation in neural networks without strong assumptions on inputs. Based on it, we propose an analytic theory to prove that mutual information maximization is realized between inputs and propagated signals when neural networks are initialized at dynamic isometry, a case where information transmits via norm-preserving mappings. These theoretical predictions are validated by experiments on real neural networks, suggesting the robustness of our theory against finite-size effects. Finally, we analyze our findings with information bottleneck theory to confirm the precise relations among dynamic isometry, mutual information maximization, and optimal channel properties in deep learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题