RADDLE：鲁棒的评估基准和分析平台，用于强大的以任务为导向的对话框系统

论文标题

RADDLE：鲁棒的评估基准和分析平台，用于强大的以任务为导向的对话框系统

RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems

论文作者

Peng, Baolin, Li, Chunyuan, Zhang, Zhu, Zhu, Chenguang, Li, Jinchao, Gao, Jianfeng

论文摘要

为了使以任务为导向的对话框最大程度地有用，它必须能够以（1）可推广的方式处理对话，并在新任务域中提供少量培训示例，以及（2）以各种样式，模式或域的用户输入对用户输入。为了实现这些目标，我们介绍了Raddle Benchmark，这是一个集合的Corpora和工具，用于评估各种领域的模型的性能。通过包括具有有限培训数据的任务，Raddle旨在有利于和鼓励具有强大概括能力的模型。 Raddle还包括一个诊断清单，该清单有助于诸如语言变化，语音错误，看不见的实体和室外话语等方面的详细鲁棒性分析。我们根据预训练和微调评估了最新的最新系统，并发现在异质对话框COLPOR上进行基础的预培训比每个域训练单独的模型都更好。总体而言，现有模型在鲁棒性评估中的满意度并不令人满意，这暗示了未来改进的机会。

For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题