论文标题
通用图压缩:随机块模型
Universal Graph Compression: Stochastic Block Models
论文作者
论文摘要
本文由处理大规模图数据(例如社交网络和生物网络)的普遍数据科学应用,本文以标记图的形式研究了对数据的无损压缩。特别是,我们考虑了广泛使用的随机图模型,随机块模型(SBM),该模型捕获了社交网络中的聚类效果。应用了信息理论通用压缩框架,其中一个人旨在设计一个单个压缩机,该压缩机在每个SBM分布的情况下实现渐近最佳的压缩率,而不知道SBM的参数。本文提出了这样的图形压缩机,该论文普遍实现了一类SBM的多项式时间复杂性的最佳压缩率。现有的通用压缩技术主要是为固定的千古序列开发的。但是,SBM的邻接矩阵具有复杂的二维相关性。通过经过精心设计的转换来缓解挑战,该转换将二维相关数据几乎转换为I.I.D.子膜。然后由Krichevsky-Trofimov压缩机压缩了子膜的序列,其长度分析被推广到相同分布但任意相关的序列。在四个基准图数据集中,来自竞争算法的压缩文件的2.4至27倍的空间是拟议方案所需的空间。
Motivated by the prevalent data science applications of processing large-scale graph data such as social networks and biological networks, this paper investigates lossless compression of data in the form of a labeled graph. Particularly, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. An information-theoretic universal compression framework is applied, in which one aims to design a single compressor that achieves the asymptotically optimal compression rate, for every SBM distribution, without knowing the parameters of the SBM. Such a graph compressor is proposed in this paper, which universally achieves the optimal compression rate with polynomial time complexity for a wide class of SBMs. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences. However, the adjacency matrix of SBM has complex two-dimensional correlations. The challenge is alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. submatrices. The sequence of submatrices is then compressed by a Krichevsky--Trofimov compressor, whose length analysis is generalized to identically distributed but arbitrarily correlated sequences. In four benchmark graph datasets, the compressed files from competing algorithms take 2.4 to 27 times the space needed by the proposed scheme.