通过质量多样性算法在稀疏的奖励设置中学习

论文标题

通过质量多样性算法在稀疏的奖励设置中学习

Learning in Sparse Rewards settings through Quality-Diversity algorithms

论文作者

Paolo, Giuseppe

论文摘要

在强化学习（RL）框架中，学习是通过奖励信号来指导的。这意味着，在稀疏奖励的情况下，代理必须专注于探索，以发现哪种行动或一组行动会导致奖励。 RL代理通常为此而苦苦挣扎。探索是质量多样性（QD）方法的重点。在本文中，我们通过这些算法解决了稀疏奖励的问题，尤其是新颖的搜索（NS）。这是一种仅关注可能政策行为的多样性的方法。论文的第一部分重点是学习评估策略多样性的空间的表示。在这方面，我们提出了分类算法，该算法通过自动编码器学习搜索空间的低维表示方法。尽管有效，但分类单元仍然需要有关何时捕获用于学习上述空间的观察的信息。为此，我们研究了多种方式，尤其是签名变换，以编码有关观测的整个轨迹的信息。论文继续引入宁静算法，该方法可以有效地关注搜索空间的有趣部分。该方法将搜索空间的探索与通过两端步骤方法的奖励的剥削分开。探索是通过NS进行的。然后，任何发现的奖励将通过发射器本地利用。第三个也是最后一项贡献将分类单元和宁静结合在一起：Stax。在整个论文中，我们介绍了降低稀疏奖励设置所需的先前信息量的方法。这些贡献是朝着可以自主探索并在各种稀疏奖励环境中自主探索并找到高性能政策的方法的有希望的一步。

In the Reinforcement Learning (RL) framework, the learning is guided through a reward signal. This means that in situations of sparse rewards the agent has to focus on exploration, in order to discover which action, or set of actions leads to the reward. RL agents usually struggle with this. Exploration is the focus of Quality-Diversity (QD) methods. In this thesis, we approach the problem of sparse rewards with these algorithms, and in particular with Novelty Search (NS). This is a method that only focuses on the diversity of the possible policies behaviors. The first part of the thesis focuses on learning a representation of the space in which the diversity of the policies is evaluated. In this regard, we propose the TAXONS algorithm, a method that learns a low-dimensional representation of the search space through an AutoEncoder. While effective, TAXONS still requires information on when to capture the observation used to learn said space. For this, we study multiple ways, and in particular the signature transform, to encode information about the whole trajectory of observations. The thesis continues with the introduction of the SERENE algorithm, a method that can efficiently focus on the interesting parts of the search space. This method separates the exploration of the search space from the exploitation of the reward through a two-alternating-steps approach. The exploration is performed through NS. Any discovered reward is then locally exploited through emitters. The third and final contribution combines TAXONS and SERENE into a single approach: STAX. Throughout this thesis, we introduce methods that lower the amount of prior information needed in sparse rewards settings. These contributions are a promising step towards the development of methods that can autonomously explore and find high-performance policies in a variety of sparse rewards settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题