投票'n'rank：通过社会选择理论进行基准测试的修订

论文标题

投票'n'rank：通过社会选择理论进行基准测试的修订

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

论文作者

Rofin, Mark, Mikhailov, Vladislav, Florinskiy, Mikhail, Kravchenko, Andrey, Tutubalina, Elena, Shavrina, Tatiana, Karabekyan, Daniel, Artemova, Ekaterina

论文摘要

机器学习不同应用领域（ML）的最先进系统的开发是由基准驱动的，这是从多个角度塑造了评估概括能力的范式。尽管该范式跨越了各种任务的更细粒度的评估，但如何汇总表演的微妙问题对社区产生了特别的兴趣。通常，基准测试遵循不言而喻的功利原则，在该原理中，系统根据其在特定于任务的指标上的平均得分进行排名。这种汇总过程已被视为一种优化的评估协议，这可能会产生进步的幻想。本文提出了投票'n'rank，这是一个根据社会选择理论原则在多任务基准中排名系统的框架。我们证明，我们的方法可以有效地用于在几个ML子场中进行基准测试，并确定研究和发展案例研究中表现最佳的系统。投票“兰克的程序比平均平均值更强大，同时能够处理缺失的性能得分并确定系统成为赢家的条件。

The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题