超越下游任务准确性以获取信息检索基准测试

论文标题

超越下游任务准确性以获取信息检索基准测试

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

论文作者

Santhanam, Keshav, Saad-Falcon, Jon, Franz, Martin, Khattab, Omar, Sil, Avirup, Florian, Radu, Sultan, Md Arafat, Roukos, Salim, Zaharia, Matei, Potts, Christopher

论文摘要

近年来，神经信息检索（IR）系统在很大程度上取得了迅速发展，这在很大程度上是由于发布了公开可用的基准测试任务。不幸的是，这一进步的一些方面是虚幻的：当今大多数流行的IR基准都专注于下游任务的准确性，从而掩盖了交易效率以提高质量的系统所产生的成本。延迟，硬件成本和其他效率注意事项对于在面向用户的设置中部署IR系统至关重要。我们建议，IR基准将其评估方法构建为不仅包括准确性的指标，还包括效率注意事项，例如查询延迟和可再现硬件设置的相应成本预算。对于流行的IR基准MARCO和XOR-TYDI，我们展示了IR系统的最佳选择如何根据这些效率注意事项的选择和称重方式而变化。我们希望未来的基准将采用这些准则来进行更全面的IR评估。

Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题