CU_FASTTUCKER：对多GPU的平行稀疏Tucker分解的更快，更稳定的随机优化

论文标题

CU_FASTTUCKER：对多GPU的平行稀疏Tucker分解的更快，更稳定的随机优化

cu_FastTucker: A Faster and Stabler Stochastic Optimization for Parallel Sparse Tucker Decomposition on Multi-GPUs

论文作者

Li, Zixuan

论文摘要

高阶，高维和稀疏张量（HOHDST）数据源自实际的工业应用，即社交网络，推荐系统，生物信息和流量信息。稀疏张量分解（STD）可以将HOHDST数据投放到低级别空间中。在这项工作中，提出了一种用于近似核心张量和近似整个梯度的核心张量和随机策略的新方法（2）随机策略采用一步随机抽样集，其体积比原始的小得多，以近似整个梯度。同时，此方法可以保证收敛并保存内存开销。由于与随机策略相同的顺序矩阵乘法和并行访问的紧凑性，GPU可以进一步加强cufasttucker的速度。此外，由于不能在单个GPU中容纳大规模的HOHDST数据，因此提出了cufasttucker的数据部门和通信策略，以用于多GPU的数据适应。与SOTA算法相比，CufastTucker可以达到最快的速度，并保持相同的准确性和低得多的内存开销，例如P $ - $ tucker，cest和sgd $ \ _ $ tucker。代码和部分数据集公开可在“ https://github.com/zixuanli-china/fasttucker”上获得。

High-Order, High-Dimension, and Sparse Tensor (HOHDST) data originates from real industrial applications, i.e., social networks, recommender systems, bio-information, and traffic information. Sparse Tensor Decomposition (STD) can project the HOHDST data into low-rank space. In this work, a novel method for STD of Kruskal approximating the core tensor and stochastic strategy for approximating the whole gradient is proposed which comprises of the following two parts: (1) the matrization unfolding order of the Kruskal product for the core tensor follows the multiplication order of the factor matrix and then the proposed theorem can reduce the exponential computational overhead into linear one; (2) stochastic strategy adopts one-step random sampling set, the volume of which is much smaller than original one, to approximate the whole gradient. Meanwhile, this method can guarantee the convergence and save the memory overhead. Due to the compactness of the same order matrix multiplication and parallel access from stochastic strategy, the speed of cuFastTucker can be further reinforced by GPU. Furthermore, %because large-scale HOHDST data cannot be accommodated in a single GPU, a data division and communication strategy of cuFastTucker is proposed for data accommodation on Multi-GPU. cuFastTucker can achieve the fastest speed and keep the same accuracy and much lower memory overhead than the SOTA algorithms, e.g., P$-$Tucker, Vest, and SGD$\_$Tucker. The code and partial datasets are publically available on "https://github.com/ZixuanLi-China/FastTucker".

下载PDF全文

下载文献需遵守相关版权规定

论文标题