论文标题
模型退化阻碍了深图神经网络
Model Degradation Hinders Deep Graph Neural Networks
论文作者
论文摘要
图形神经网络(GNN)在各种图挖掘任务中取得了巨大的成功。结果,大多数GNN仅具有浅层建筑,这限制了它们的表现力和对深社区的开发。最近的研究将深度GNN的性能降低归因于\ textit {过度平滑}问题。在本文中,我们将传统的图形卷积操作分为两个独立的操作:\ textIt {passagation}(\ textbf {p})和\ textit {transformation}(\ textbf {t})。此后,可以将GNN的深度分为传播深度_($ d_p $)和$ d_p $ defressation($ d_p $)和$ d_ $ defressation($)和$ defressation($)($)。通过广泛的实验,我们发现深gnns性能下降的主要原因是\ textIt {model降解}问题是由大$ d_t $而不是\ textit {过度平滑}问题,主要由大$ d_p $引起。此外,我们提出\ textIt {自适应初始残留}(air),即与各种GNN架构兼容的插件模块,以减轻\ textit {model dradation degradation}问题和\ textit {textit {过度弹性}的问题。六个现实世界数据集的实验结果表明,配备空气的GNN胜过大多数具有浅建筑的GNN,这是由于大型$ d_p $和$ d_t $的好处,而与空气相关的时间成本则可以忽略。
Graph Neural Networks (GNNs) have achieved great success in various graph mining tasks.However, drastic performance degradation is always observed when a GNN is stacked with many layers. As a result, most GNNs only have shallow architectures, which limits their expressive power and exploitation of deep neighborhoods.Most recent studies attribute the performance degradation of deep GNNs to the \textit{over-smoothing} issue. In this paper, we disentangle the conventional graph convolution operation into two independent operations: \textit{Propagation} (\textbf{P}) and \textit{Transformation} (\textbf{T}).Following this, the depth of a GNN can be split into the propagation depth ($D_p$) and the transformation depth ($D_t$). Through extensive experiments, we find that the major cause for the performance degradation of deep GNNs is the \textit{model degradation} issue caused by large $D_t$ rather than the \textit{over-smoothing} issue mainly caused by large $D_p$. Further, we present \textit{Adaptive Initial Residual} (AIR), a plug-and-play module compatible with all kinds of GNN architectures, to alleviate the \textit{model degradation} issue and the \textit{over-smoothing} issue simultaneously. Experimental results on six real-world datasets demonstrate that GNNs equipped with AIR outperform most GNNs with shallow architectures owing to the benefits of both large $D_p$ and $D_t$, while the time costs associated with AIR can be ignored.