通过从人类反馈中学习的强化学习培训有用且无害的助手

论文标题

通过从人类反馈中学习的强化学习培训有用且无害的助手

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

论文作者

Bai, Yuntao, Jones, Andy, Ndousse, Kamal, Askell, Amanda, Chen, Anna, DasSarma, Nova, Drain, Dawn, Fort, Stanislav, Ganguli, Deep, Henighan, Tom, Joseph, Nicholas, Kadavath, Saurav, Kernion, Jackson, Conerly, Tom, El-Showk, Sheer, Elhage, Nelson, Hatfield-Dodds, Zac, Hernandez, Danny, Hume, Tristan, Johnston, Scott, Kravec, Shauna, Lovitt, Liane, Nanda, Neel, Olsson, Catherine, Amodei, Dario, Brown, Tom, Clark, Jack, McCandlish, Sam, Olah, Chris, Mann, Ben, Kaplan, Jared

论文摘要

我们将偏好模型和强化学习从人类反馈（RLHF）学习到Finetune语言模型，以充当有益且无害的助手。我们发现这种对齐培训几乎可以提高所有NLP评估的性能，并且与诸如Python编码和摘要之类的专业技能的培训完全兼容。我们探讨了一种迭代的在线培训方式，其中偏好模型和RL策略每周都会通过新鲜的人类反馈数据进行更新，从而有效地改善了我们的数据集和模型。最后，我们研究了RLHF培训的鲁棒性，并确定RL奖励与KL差异的平方根之间的大致线性关系与其初始化之间的差异。除了我们的主要结果之外，我们还对校准，竞争目标以及使用OOD检测进行外围分析，将模型与人类作家进行比较，并使用最近相关工作中出现的提示从模型中提供样本。

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题