人类偏好作为决斗匪徒

论文标题

人类偏好作为决斗匪徒

Human Preferences as Dueling Bandits

论文作者

Yan, Xinyi, Luo, Chengxi, Clarke, Charles L. A., Craswell, Nick, Voorhees, Ellen M., Castells, Pablo

论文摘要

神经排名者产生的核心信息检索任务的巨大改进创造了对新型评估方法的需求。如果每个排名者都返回最高等级的高度相关项目，则很难识别它们之间的有意义的差异并构建可重复使用的测试集。最近的几篇论文探讨了成对的偏好判断，以替代传统的分级相关性评估。评估人员不是一次查看项目，而是并排查看项目，并指出了对查询的更好响应，从而可以进行细粒度的区分。如果我们采用偏好判断来确定每个查询的最佳项目，我们可以通过将这些项目尽可能高的能力来衡量排名者。我们将找到最佳物品作为决斗匪徒问题的问题。尽管许多论文通过交织来探索针对在线排名评估的决斗匪徒，但尚未将其视为通过人类偏好判断的离线评估的框架。我们回顾了可能的解决方案的文献。对于人类的偏好判断，任何可用的算法都必须忍受关系，因为两个项目可能看起来几乎等于评估者，并且必须最大程度地减少任何特定对所需的判断数量，因为每个这样的比较都需要一个独立的评估者。由于大多数算法提供的理论保证取决于人类偏好判断不满足的假设，因此我们模拟了代表性测试案例的选定算法，以洞悉其实际实用性。基于这些模拟，一种算法因其潜力而脱颖而出。我们的模拟提出了改进的修改，以进一步提高其性能。使用修改后的算法，我们收集了10,000多个偏好判决，以提交TREC 2021深度学习曲目，从而确认其适用性。

The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. We review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties, since two items may appear nearly equal to assessors, and it must minimize the number of judgments required for any specific pair, since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, we simulate selected algorithms on representative test cases to provide insight into their practical utility. Based on these simulations, one algorithm stands out for its potential. Our simulations suggest modifications to further improve its performance. Using the modified algorithm, we collect over 10,000 preference judgments for submissions to the TREC 2021 Deep Learning Track, confirming its suitability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题