论文标题
对端到端的面向任务对话系统的重新响应过度产生的响应
Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems
论文作者
论文摘要
端到端(E2E)以任务为导向的对话(TOD)系统容易属于所谓的“似然陷阱”,从而产生了乏味,重复性且通常与对话历史记录不一致的响应。比较多个生成响应的排名列表与“黄金响应”(来自评估数据)的排名列表表明,响应质量的多样性广泛,排名列表中的许多良好响应都较低。这项工作中提到的主要挑战是如何超越贪婪生成的系统响应,即如何从推断中的过度产生响应列表中获取和选择此类高质量的响应,而无需使用黄金响应。为此,我们提出了一种简单而有效的重读方法,该方法旨在从最初被系统过度产生的响应列表中选择高质量的项目。这个想法是使用任何序列级别(相似性)评分函数将响应的语义空间分为高分与低分分区。在训练中,高分分区包括所有与黄金响应相似性的产生的响应,高于对黄金响应的贪婪响应的相似性。在推断时,目的是估计仅考虑到以前的对话历史,每个过度产生的响应属于高分分区的可能性。我们在标准的多沃兹数据集上验证了我们提出的方法的鲁棒性和多功能性:我们的方法通过2.0 bleu,1.6 rouge和1.3 Meteor得分提高了最先进的E2E TOD系统,从而实现了新的峰值结果。对BITOD数据集和人类评估的其他实验进一步确定了拟议框架的普遍性和有效性。
End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called "likelihood trap", resulting in generated responses which are dull, repetitive, and often inconsistent with dialogue history. Comparing ranked lists of multiple generated responses against the "gold response" (from evaluation data) reveals a wide diversity in response quality, with many good responses placed lower in the ranked list. The main challenge, addressed in this work, is then how to reach beyond greedily generated system responses, that is, how to obtain and select such high-quality responses from the list of overgenerated responses at inference without availability of the gold response. To this end, we propose a simple yet effective reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system. The idea is to use any sequence-level (similarity) scoring function to divide the semantic space of responses into high-scoring versus low-scoring partitions. At training, the high-scoring partition comprises all generated responses whose similarity to the gold response is higher than the similarity of the greedy response to the gold response. At inference, the aim is to estimate the probability that each overgenerated response belongs to the high-scoring partition, given only previous dialogue history. We validate the robustness and versatility of our proposed method on the standard MultiWOZ dataset: our methods improve a state-of-the-art E2E ToD system by 2.0 BLEU, 1.6 ROUGE, and 1.3 METEOR scores, achieving new peak results. Additional experiments on the BiTOD dataset and human evaluation further ascertain the generalisability and effectiveness of the proposed framework.