论文标题
无监督的空间分区以用于最近的邻居搜索
Unsupervised Space Partitioning for Nearest Neighbor Search
论文作者
论文摘要
在高维空间中的大约最近的邻居搜索(ANN)对于许多现实生活中的应用程序(例如,电子商务,Web,多媒体等)至关重要。本文提出了一个端到端的学习框架,该框架将分区(ANN的一个关键步骤)和使用自定义损失函数进行搜索步骤。我们提出的解决方案的关键优势是,它不需要对数据集进行任何昂贵的预处理,这是最新方法的关键局限性之一。我们通过制定不需要地面真实标签来量化给定数据空间分区的质量的多目标自定义损耗函数来实现上述边缘,从而完全无法监督。我们还通过在损失功能中添加不同的输入权重来训练模型集合以增强搜索质量来提出一种结合技术。在几个标准的ANN标准基准上,我们表明我们的方法击败了最新的空间分区方法和无处不在的K-均值聚类方法,同时使用较少的参数和较短的离线训练时间。我们还表明,将我们的空间分区策略纳入诸如扫描之类的最先进的ANN技术中可以显着提高其性能。最后,我们将无监督的分区方法作为许多广泛使用的聚类方法的有希望的替代方法,例如K-均值聚类和DBSCAN。
Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.