论文标题
与MPI的分布式TERA级相似性搜索:在没有单距离计算的数十亿个超过数十亿个搜索
Distributed Tera-Scale Similarity Search with MPI: Provably Efficient Similarity Search over billions without a Single Distance Computation
论文作者
论文摘要
我们提出了基于MPI(消息传递接口)的分布式系统,以示意斜线(草图的局部性敏感哈希),以通过Terabyte scale数据集进行近似相似性搜索。 Slash提供了流行的LSH(局部敏感哈希)算法的多节点实现,该算法通常在单个计算机上实现。我们展示了如何使用重型击球手素描附加LSH算法,以证明无需单个距离计算即可解决(高)相似性搜索问题。总体而言,我们在数学上表明,在现实的数据假设下,我们可以识别给定查询$ q $ sub-linear($ \ ll o(n)$)的近邻居,仅简单的草图聚合操作。为了使这样的系统实用,我们提供了一种新颖的设计和草图解决方案,以指数级减少机间通信开销。在直接对可比硬件的直接比较中,Slash比Pyspark中流行的LSH软件包快10000倍以上。 Pyspark是针对大型数据集的LSH算法的广泛补充的分布式实现,并部署在商业平台中。最后,我们展示了如何使用超过40亿个样品的系统尺度到TERA级的Criteo数据集。 Slash可以在一个小时的时间内索引此2.3 Terabyte数据,超过20个节点,而查询时间则以毫秒为单位。据我们所知,没有开源系统可以用商品集群对Criteo进行索引并进行相似性搜索。
We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query $q$ in sub-linear ($ \ll O(n)$) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.