回顾2021年玄武岩竞赛从人类反馈中学习

论文标题

回顾2021年玄武岩竞赛从人类反馈中学习

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback

论文作者

Shah, Rohin, Wang, Steven H., Wild, Cody, Milani, Stephanie, Kanervisto, Anssi, Goecks, Vinicius G., Waytowich, Nicholas, Watkins-Valls, David, Prakash, Bharat, Mills, Edmund, Garg, Divyansh, Fries, Alexander, Souly, Alexandra, Shern, Chan Jun, del Castillo, Daniel, Lieberum, Tom

论文摘要

我们在第35届神经信息处理系统会议（Neurips 2021）举行了有史以来第一个针对解决几乎友好的任务（Minerl Basalt）竞争的矿工基准。竞争的目的是促进对使用人类反馈（LFHF）技术学习来解决开放世界任务的代理商的研究。我们没有强制使用LFHF技术，而是用自然语言中的四个任务描述了在视频游戏Minecraft中完成的，并允许参与者使用他们想要构建可以完成任务的代理的任何方法。团队在各种可能的人类反馈类型中开发了各种LFHF算法。这三个获胜的球队实施了显着不同的方法，同时取得了相似的表现。有趣的是，他们的方法在不同的任务上表现良好，从而验证了我们选择的任务以包括在比赛中。尽管结果证实了我们的比赛设计，但我们没有像姊妹竞赛Minerl Diamond那样多的参与者和提交。我们推测了此问题的原因，并建议对竞争的未来迭代进行改进。

We held the first-ever MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) Competition at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks. Rather than mandating the use of LfHF techniques, we described four tasks in natural language to be accomplished in the video game Minecraft, and allowed participants to use any approach they wanted to build agents that could accomplish the tasks. Teams developed a diverse range of LfHF algorithms across a variety of possible human feedback types. The three winning teams implemented significantly different approaches while achieving similar performance. Interestingly, their approaches performed well on different tasks, validating our choice of tasks to include in the competition. While the outcomes validated the design of our competition, we did not get as many participants and submissions as our sister competition, MineRL Diamond. We speculate about the causes of this problem and suggest improvements for future iterations of the competition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题