论文标题
通过文化学习可控对话模拟
Controllable Dialogue Simulation with In-Context Learning
论文作者
论文摘要
构建对话系统需要大量的带注释的对话。这些数据集通常是通过众包创建的,众包昂贵且耗时。在本文中,我们提出了\ textsc {Dialogic},这是一种基于大语言模型的新颖对话模拟方法,以自动化数据集创建自动化。 \ textsc {dialogic}带有一些注释的对话,自动为演示选择中下文示例,并提示GPT-3以可控制的方式生成新的对话和注释。我们的方法可以快速扩展一小部分对话数据,其中最低或零\ textit {人参与}和\ textit {parameter update},因此比众库更具成本效益和节省时间。多WOZ数据集的实验结果表明,与在具有挑战性的低资源设置下使用相同数量的人类生成的对话相比,对模拟对话进行的训练模型更具性能,其对话与种子的对话仅为85个。当有足够的数据可用时,我们的方法仍然可以用作有效的数据增强方法。人类评估结果还表明,我们的模拟对话具有接近人类的流利度和注释精度。代码和数据可在\ textbf {\ url {https://github.com/leezekun/dialogic}}中获得。
Building dialogue systems requires a large corpus of annotated dialogues. Such datasets are usually created via crowdsourcing, which is expensive and time-consuming. In this paper, we propose \textsc{Dialogic}, a novel dialogue simulation method based on large language model in-context learning to automate dataset creation. Seeded with a few annotated dialogues, \textsc{Dialogic} automatically selects in-context examples for demonstration and prompts GPT-3 to generate new dialogues and annotations in a controllable way. Our method can rapidly expand a small set of dialogue data with minimum or zero \textit{human involvement} and \textit{parameter update} and is thus much more cost-efficient and time-saving than crowdsourcing. Experimental results on the MultiWOZ dataset demonstrate that training a model on the simulated dialogues leads to even better performance than using the same amount of human-generated dialogues under the challenging low-resource settings, with as few as 85 dialogues as a seed. When enough data is available, our method can still serve as an effective data augmentation method. Human evaluation results also show that our simulated dialogues have near-human fluency and annotation accuracy. The code and data are available at \textbf{\url{https://github.com/Leezekun/dialogic}}.