窗格：可扩展有效的归因网络嵌入

论文标题

窗格：可扩展有效的归因网络嵌入

PANE: scalable and effective attributed network embedding

论文作者

Yang, Renchi, Shi, Jieming, Xiao, Xiaokui, Yang, Yin, Bhowmick, Sourav S., Liu, Juncheng

论文摘要

给定每个节点与一组属性关联的图G，属性网络嵌入（ANE）将每个节点V中的每个节点V映射到紧凑型向量XV，可用于下游机器学习任务。理想情况下，XV应捕获节点V对每个属性的亲和力，这不仅考虑V的属性关联，而且考虑其沿G中边缘的连接节点的属性关联。将有效的ANE计算扩展到大量图表将问题的难度推向了一个全新的水平。现有的解决方案在此类图上很大程度上失败了，导致成本高昂，低质量的嵌入或两者兼而有之。本文提出了窗格，这是用于大量图表的ANE计算方法的有效且可扩展的方法，该图表可在多个基准数据集上实现最先进的结果质量。窗格通过3种主要算法设计获得了高可扩展性和有效性。首先，它根据属性网络的新型随机步行模型制定了学习目标。其次，窗格包括上述优化问题的高效求解器，其关键模块是对嵌入的精心设计的初始化，从而大大减少了收敛所需的迭代次数。最后，窗格通过上述求解器的非平行平行化利用了多核CPU，该求解器可达到可扩展性，同时保留了所得嵌入的高质量。窗格的性能取决于输入网络中属性的数量。为了处理具有许多属性的大型网络，我们将窗格进一步扩展到窗格++。比较了8个真实数据集上的10种现有方法的广泛实验，表明窗格和窗格++在结果质量方面始终优于所有现有方法，同时更快的数量级。

Given a graph G where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Ideally, Xv should capture node v's affinity to each attribute, which considers not only v's own attribute associations, but also those of its connected nodes along edges in G. It is challenging to obtain high-utility embeddings that enable accurate predictions; scaling effective ANE computation to massive graphs pushes the difficulty of the problem to a whole new level. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. This paper proposes PANE, an effective and scalable approach to ANE computation for massive graphs that achieves state-of-the-art result quality on multiple benchmark datasets. PANE obtains high scalability and effectiveness through 3 main algorithmic designs. First, it formulates the learning objective based on a novel random walk model for attributed networks. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. The performance of PANE depends upon the number of attributes in the input network. To handle large networks with numerous attributes, we further extend PANE to PANE++. Extensive experiments, comparing 10 existing approaches on 8 real datasets, demonstrate that PANE and PANE++ consistently outperform all existing methods in terms of result quality, while being orders of magnitude faster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题