格子：用于晚期互动检索的有效引擎

论文标题

格子：用于晚期互动检索的有效引擎

PLAID: An Efficient Engine for Late Interaction Retrieval

论文作者

Santhanam, Keshav, Khattab, Omar, Potts, Christopher, Zaharia, Matei

论文摘要

预训练的语言模型越来越重要的组成部分（IR）范式中的组成部分。与Colbert Model一起引入并最近在ColbertV2中提出的后期互动是一种流行的范式，在许多基准测试中都具有最先进的地位。为了极大地加快晚期互动的搜索延迟，我们介绍了性能优化的后期互动驱动器（格子）。如果没有影响质量，格子就使用一种新型的质心相互作用机制迅速消除了低分段落，该机制将每个通道都视为轻量级的质心袋。格子使用质心相互作用以及质心修剪，这是一种稀疏质心的机制，在高度优化的引擎中，可将晚期相互作用搜索潜伏期减少7 $ \ times $ \ times $ \ times $ \ times $ \ times $ \ times $ \ times $ \ timple $ \ times $ \ times times times times times in CPU，而对Vanilla colbertv2的CPU则继续提供州级的改性，以提供州级的改性。这使带有COLBERTV2的格子发动机能够在GPU和数十毫秒上达到数十毫秒的潜伏期，或者仅在CPU上仅几百毫秒，即使在我们使用140m通道评估的最大尺度上，也可以在140m的通道上延迟。

Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7$\times$ on a GPU and 45$\times$ on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题