论文标题
大型语言模型可以实施政策迭代
Large Language Models can Implement Policy Iteration
论文作者
论文摘要
这项工作介绍了使用基础模型,介绍了一种用于执行强化学习(RL)的算法(RL)。虽然将基础模型应用于RL已受到了广泛关注,但大多数方法都依赖(1)专家演示的策划(通过手动设计或特定于任务的预处理)或(2)使用梯度方法(适配器层的微调或培训)适应了感兴趣的任务。这两种技术都有缺点。收集示范是劳动密集型的,依靠它们的算法并不能胜过示威的专家。所有梯度技术都固有地缓慢,从而牺牲了使内在学习吸引力吸引人的“几乎没有”的质量。在这项工作中,我们提出了一种算法ICPI,该算法学会了无需专家演示或梯度执行RL任务。取而代之的是,我们提出了一种策略介绍方法,其中提示内容是整个学习源。 ICPI迭代地更新了通过与RL环境的试用和错误交互来从中衍生其策略的提示内容的内容。为了消除重量学习的作用(在哪些决策变压器之类的方法上),我们使用Codex演示了我们的算法,该语言模型是对我们对其进行评估的域而没有先验了解的语言模型。
This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the "few-shot" quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex, a language model with no prior knowledge of the domains on which we evaluate it.