论文标题
NTCIR www-2,www-3和www-4英语子任务的校正评估结果
Corrected Evaluation Results of the NTCIR WWW-2, WWW-3, and WWW-4 English Subtasks
论文作者
论文摘要
不幸的是,在NTCIR-14 www-2,NTCIR-15 www-3和NTCIR-16 www-4 WWW-4概述论文中,官方英语(子)任务结果由于官方QRELS文件中的噪音而不正确;本文根据更正后的QRELS文件报告结果。噪声是由于我们相关评估接口的后端致命错误引起的。更具体地说,在www-2,www-3和www-4上,为每个英语主题创建了两个版本的池文件:一个PRI(“优先”)文件,该文件使用NTCIRPOOL脚本来优先级可能相关的文档,以及一个RND(RANCTAMERIGATE)文件,该文件随机将汇总文档进行随机化。这样做是为了研究相关评估者的文件订购效果。但是,编写接口后端的程序员假定主题ID和文档在池文件中排名的组合独特地确定文档ID。这显然是不正确的,因为我们有两个版本的池文件。结果是,www-2测试收集的所有基于PRI的相关性标签都是不正确的(而所有基于RND的相关性标签都是正确的),并且所有基于RND的相关性标签www-3和www-4测试集合不正确(而所有基于PRI的相关性标签都是正确的)。当本文的前七名作者担任金评估者(即定义相关内容的主题创建者)并与青铜评估者(即非竞争性创造者;非专家)仔细研究了分歧时,最终在NTCIR-16 WWW-4任务中发现了这个错误。我们想向WWW参与者和NTCIR椅子道歉,以至于由于此错误造成的不便和困惑。
Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two versions of pool files were created for each English topic: a PRI ("prioritised") file, which uses the NTCIRPOOL script to prioritise likely relevant documents, and a RND ("randomised") file, which randomises the pooled documents. This was done for the purpose of studying the effect of document ordering for relevance assessors. However, the programmer who wrote the interface backend assumed that a combination of a topic ID and a document rank in the pool file uniquely determines a document ID; this is obviously incorrect as we have two versions of pool files. The outcome is that all the PRI-based relevance labels for the WWW-2 test collection are incorrect (while all the RND-based relevance labels are correct), and all the RND-based relevance labels for the WWW-3 and WWW-4 test collections are incorrect (while all the PRI-based relevance labels are correct). This bug was finally discovered at the NTCIR-16 WWW-4 task when the first seven authors of this paper served as Gold assessors (i.e., topic creators who define what is relevant) and closely examined the disagreements with Bronze assessors (i.e., non-topic-creators; non-experts). We would like to apologise to the WWW participants and the NTCIR chairs for the inconvenience and confusion caused due to this bug.