与MPI的分布式TERA级相似性搜索：在没有单距离计算的数十亿个超过数十亿个搜索

论文标题

与MPI的分布式TERA级相似性搜索：在没有单距离计算的数十亿个超过数十亿个搜索

Distributed Tera-Scale Similarity Search with MPI: Provably Efficient Similarity Search over billions without a Single Distance Computation

论文作者

Meisburger, Nicholas, Shrivastava, Anshumali

论文摘要

我们提出了基于MPI（消息传递接口）的分布式系统，以示意斜线（草图的局部性敏感哈希），以通过Terabyte scale数据集进行近似相似性搜索。 Slash提供了流行的LSH（局部敏感哈希）算法的多节点实现，该算法通常在单个计算机上实现。我们展示了如何使用重型击球手素描附加LSH算法，以证明无需单个距离计算即可解决（高）相似性搜索问题。总体而言，我们在数学上表明，在现实的数据假设下，我们可以识别给定查询$ q $ sub-linear（$ \ ll o（n）$）的近邻居，仅简单的草图聚合操作。为了使这样的系统实用，我们提供了一种新颖的设计和草图解决方案，以指数级减少机间通信开销。在直接对可比硬件的直接比较中，Slash比Pyspark中流行的LSH软件包快10000倍以上。 Pyspark是针对大型数据集的LSH算法的广泛补充的分布式实现，并部署在商业平台中。最后，我们展示了如何使用超过40亿个样品的系统尺度到TERA级的Criteo数据集。 Slash可以在一个小时的时间内索引此2.3 Terabyte数据，超过20个节点，而查询时间则以毫秒为单位。据我们所知，没有开源系统可以用商品集群对Criteo进行索引并进行相似性搜索。

We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query $q$ in sub-linear ($ \ll O(n)$) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题