论文标题

索引度量空间以进行精确相似性搜索

Indexing Metric Spaces for Exact Similarity Search

论文作者

Chen, Lu, Gao, Yunjun, Song, Xuan, Li, Zheng, Zhu, Yifan, Miao, Xiaoye, Jensen, Christian S.

论文摘要

随着社会过程的持续数字化,我们看到了可用数据的爆炸。这被称为大数据。在研究环境中,在尝试从大数据中创造价值时,数据的三个方面通常被视为挑战的主要来源:音量,速度和多样性。许多研究涉及数量或速度,而较少的研究涉及多样性。公制空间是解决多样性的理想选择,因为它们可以容纳任何数据,只要它可以配备满足三角形不平等的距离概念即可。为了加速度量空间的搜索,已经提出了用于度量数据的索引技术的集合。但是,现有的调查提供了有限的覆盖范围,并且尚未报告全面的实证研究。我们对支持确切相似性搜索的现有度量指数进行了全面调查:我们总结了公制索引使用的现有分区,修剪和验证技术,以支持确切的相似性搜索;我们提供指数构建的时间和空间复杂性分析;我们提供了他们查询处理性能的经验比较。在评估度量索引性能时,经验研究很重要,因为性能可以在很大程度上取决于可用的修剪和验证的有效性以及数据分布,这意味着复杂性分析通常提供有限的见解。本文旨在揭示不同索引技术的优势和劣势,以提供为给定设置选择适当的索引技术的指导,并为未来的公制索引研究提供指导。

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源