红色小组语言模型以减少危害：方法，缩放行为和经验教训

论文标题

红色小组语言模型以减少危害：方法，缩放行为和经验教训

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

论文作者

Ganguli, Deep, Lovitt, Liane, Kernion, Jackson, Askell, Amanda, Bai, Yuntao, Kadavath, Saurav, Mann, Ben, Perez, Ethan, Schiefer, Nicholas, Ndousse, Kamal, Jones, Andy, Bowman, Sam, Chen, Anna, Conerly, Tom, DasSarma, Nova, Drain, Dawn, Elhage, Nelson, El-Showk, Sheer, Fort, Stanislav, Hatfield-Dodds, Zac, Henighan, Tom, Hernandez, Danny, Hume, Tristan, Jacobson, Josh, Johnston, Scott, Kravec, Shauna, Olsson, Catherine, Ringer, Sam, Tran-Johnson, Eli, Amodei, Dario, Brown, Tom, Joseph, Nicholas, McCandlish, Sam, Olah, Chris, Kaplan, Jared, Clark, Jack

论文摘要

我们描述了我们为红色团队语言模型的早期努力，以便同时发现，衡量并尝试减少其潜在有害产量。我们做出三个主要贡献。首先，我们研究了跨3个模型大小（2.7b，13b和52b参数）和4种模型类型的红色组合的缩放行为：一种普通语言模型（LM）； LM提示有帮助，诚实和无害；具有拒绝采样的LM；并通过从人类反馈（RLHF）中学习的强化学习而受过训练，可以有用和无害。我们发现，随着RLHF的扩展，RLHF模型的红色团队越来越困难，并且我们发现了其他模型类型的平面趋势。其次，我们发布了38,961个红色团队攻击的数据集，以供其他人分析和学习。我们提供对数据的分析，并找到各种有害产出，范围从进攻性语言到更巧妙的有害非暴力不道德产出。第三，我们详尽地描述了有关红色团队的说明，过程，统计方法和不确定性。我们希望这种透明度可以加快我们作为一个社区共同努力的能力，以制定共同的规范，实践和技术标准如何红色团队语言模型。

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题