论文标题
部署后学习新技能:通过人类反馈改善开放域Internet驱动的对话
Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback
论文作者
论文摘要
经过培训的模拟静态数据集的冷冻模型永远无法提高其性能。可以使用互联网进行互联网进行最新信息并在部署期间从人类那里获得反馈的模型提供了适应新信息并提高其性能的承诺。在这项工作中,我们研究了如何在此类学习框架中提高以互联网为导向的对话技能。我们收集了人类互动的部署数据,并将其公开可用,并收集各种类型的人类反馈 - 包括二进制质量测量,自由形式的文本反馈和罚款良好的失败原因。然后,我们研究了从此类反馈中改进的各种算法,包括标准监督学习,拒绝抽样,模型引导和基于奖励的学习,以便就哪种类型的反馈和算法效果最好。我们发现最近介绍的导演模型(Arora等人,'22)比其他现有方法显示出显着改善。
Frozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We collect deployment data, which we make publicly available, of human interactions, and collect various types of human feedback -- including binary quality measurements, free-form text feedback, and fine-grained reasons for failure. We then study various algorithms for improving from such feedback, including standard supervised learning, rejection sampling, model-guiding and reward-based learning, in order to make recommendations on which type of feedback and algorithms work best. We find the recently introduced Director model (Arora et al., '22) shows significant improvements over other existing approaches.