论文标题
QRELSCORE:更好地评估生成的问题,以更深入地了解上下文感知相关性
QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance
论文作者
论文摘要
评估问题发电的现有指标不仅需要昂贵的人类参考,而且还没有考虑到发电的投入背景,从而使对生成问题和输入环境之间的相关性缺乏深刻的了解。结果,当它(i)涉及与上下文的复杂推理或(ii)可以在上下文中的多个证据的基础上,他们可能会错误地惩罚合法且合理的候选问题。在本文中,我们提出了$ \ textbf {qrelsCore} $,一种上下文意识$ \ usewises {\ textbf {rel}} $ evance评估公制$ \ usewance {\ textbf {q}}} $ uestion $ uestion $ uestion $ uestion $ uestion $ uestion。基于BERT和GPT2等现成的语言模型,Qrelscore均采用了单词级层次匹配和基于句子级的及时的一代,以应对复杂的推理和从多个证据中的多元化产生。与现有指标相比,我们的实验表明,Qrelscore能够与人类判断建立更高的相关性,同时对对抗样本更加强大。
Existing metrics for assessing question generation not only require costly human reference but also fail to take into account the input context of generation, rendering the lack of deep understanding of the relevance between the generated questions and input contexts. As a result, they may wrongly penalize a legitimate and reasonable candidate question when it (i) involves complicated reasoning with the context or (ii) can be grounded by multiple evidences in the context. In this paper, we propose $\textbf{QRelScore}$, a context-aware $\underline{\textbf{Rel}}$evance evaluation metric for $\underline{\textbf{Q}}$uestion Generation. Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation to cope with the complicated reasoning and diverse generation from multiple evidences, respectively. Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.