论文标题
熵增强学习
Entropy Augmented Reinforcement Learning
论文作者
论文摘要
通过信任区域方法的存在,可以扩展和高效地促进了深层的增强学习。但是,这种算法的悲观情绪,其中包括以任何方式限制信托区域,都被证明可以抑制探索和损害绩效。探索性算法(例如SAC),同时利用熵来鼓励探索,隐含地优化了另一个目标。我们首先观察到这种不一致,因此提出了一种类似的增强技术,当涉及价值评论家时,该技术与政府算法很好地结合在一起。令人惊讶的是,该提出的方法一致地满足了软性政策改进定理,同时更可扩展。正如分析所建议的那样,控制温度系数以平衡探索和剥削至关重要。对Mujoco基准任务的经验测试表明,代理商对更高的奖励区域表示敬意,并且表现出色。此外,我们在一组自定义环境中验证方法的探索奖金。
Deep reinforcement learning was instigated with the presence of trust region methods, being scalable and efficient. However, the pessimism of such algorithms, among which it forces to constrain in a trust region by all means, has been proven to suppress the exploration and harm the performance. Exploratory algorithm such as SAC, while utilizes the entropy to encourage exploration, implicitly optimizing another objective yet. We first observed this inconsistency, and therefore put forward an analogous augmentation technique, which combines well with the on-policy algorithms, when a value critic is involved. Surprisingly, the proposed method consistently satisfies the soft policy improvement theorem, while being more extensible. As the analysis advises, it is crucial to control the temperature coefficient to balance the exploration and exploitation. Empirical tests on MuJoCo benchmark tasks show that the agent is heartened towards higher reward regions, and enjoys a finer performance. Furthermore, we verify the exploration bonus of our method on a set of custom environments.