论文标题
Tydi QA:以类型多样性的语言回答信息寻求信息的基准
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
论文作者
论文摘要
自信地在多语言建模上取得进展需要具有挑战性,值得信赖的评估。我们提出了Tydi QA ---回答数据集的问题,涵盖了11种具有204K询问答案对的类型多样性的语言。 Tydi QA的语言在其类型学方面是多种多样的 - 每种语言表达的语言特征集 - 我们期望模型在此设置上表现良好,可以跨越世界上的大量语言。我们介绍了对观察到的语言现象的数据质量和示例级定性语言分析的定量分析,这些语言现象在仅英语语料库中找不到。为了提供现实的信息寻求信息并避免启动效果,想知道答案但尚不知道答案的人写了问题,并且数据直接以每种语言收集,而无需使用翻译。
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA---a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology---the set of linguistic features each language expresses---such that we expect models performing well on this set to generalize across a large number of the world's languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don't know the answer yet, and the data is collected directly in each language without the use of translation.