关于数据收集对于培训一般目标政策的重要性

论文标题

关于数据收集对于培训一般目标政策的重要性

On the importance of data collection for training general goal-reaching policies

论文作者

Jacq, Alexis, Orsini, Manu, Dulac-Arnold, Gabriel, Pietquin, Olivier, Geist, Matthieu, Bachem, Olivier

论文摘要

ML的最新进展表明，模型可用的数据数量是高性能的主要瓶颈之一。尽管对于基于语言的任务，几乎存在无限量的合理一致的数据，但通常不是强化学习的情况，尤其是在处理新型环境时。实际上，即使是一个相对微不足道的连续环境也具有几乎无限的状态，但是简单地对随机状态进行采样和行动可能不会提供对任何潜在下游任务都很有趣或有用的过渡。仅在没有指示下游任务的MDP的情况下，应该如何生成大量有用的数据？数据的数量和质量是否能够真正转化为通用控制器的性能？我们建议回答这两个问题。首先，我们引入了一种原则的无监督探索方法Chronogem，该方法旨在实现对可实现状态的多种多样的统一覆盖范围，我们认为这是没有先前任务信息的最合理的目标。其次，我们研究了数据数量和数据质量对下游目标实现政策训练的影响，并表明数据数量和高质量的数据对于培训一般控制器至关重要：一项通用控制器至关重要的：能够在包括人体机器在内的众多连续控制实体上实现大量姿势的高精度姿势实现政策。

Recent advances in ML suggest that the quantity of data available to a model is one of the primary bottlenecks to high performance. Although for language-based tasks there exist almost unlimited amounts of reasonably coherent data to train from, this is generally not the case for Reinforcement Learning, especially when dealing with a novel environment. In effect, even a relatively trivial continuous environment has an almost limitless number of states, but simply sampling random states and actions will likely not provide transitions that are interesting or useful for any potential downstream task. How should one generate massive amounts of useful data given only an MDP with no indication of downstream tasks? Are the quantity and quality of data truly transformative to the performance of a general controller? We propose to answer both of these questions. First, we introduce a principled unsupervised exploration method, ChronoGEM, which aims to achieve uniform coverage over the manifold of achievable states, which we believe is the most reasonable goal given no prior task information. Secondly, we investigate the effects of both data quantity and data quality on the training of a downstream goal-achievement policy, and show that both large quantities and high-quality of data are essential to train a general controller: a high-precision pose-achievement policy capable of attaining a large number of poses over numerous continuous control embodiments including humanoid.

下载PDF全文

下载文献需遵守相关版权规定

论文标题