与行动价值评论家的离散行动进行政策学习

论文标题

与行动价值评论家的离散行动进行政策学习

Discrete Action On-Policy Learning with Action-Value Critic

论文作者

Yue, Yuguang, Tang, Yunhao, Yin, Mingzhang, Zhou, Mingyuan

论文摘要

离散动作空间中的增强学习（RL）在现实世界应用中无处不在，但其复杂性随着动作空间的维度而成倍增长，这使得应用现有的基于基于poly的基于基于poly的深度RL算法有效地有效。为了有效地在多维离散动作空间中运行，我们构建了一个评论家来估计行动价值功能，将其应用于相关的动作，并将这些评论家估计的行动值结合起来以控制梯度估计的方差。我们遵循严格的统计分析，以设计如何生成和结合这些相关的动作，以及如何通过从某些维度上关闭贡献来稀疏梯度。这些努力导致了一种新的离散作用在policy RL算法上，从经验上优于与方差控制技术相关的与政策算法相关的算法。我们在OpenAI Gym基准任务上演示了这些属性，并说明了对动作空间的离散如何使勘探阶段受益，从而促进了由于离散策略的灵活性而促进与更好的本地最佳解决方案的收敛。

Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms efficiently. To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. We follow rigorous statistical analysis to design how to generate and combine these correlated actions, and how to sparsify the gradients by shutting down the contributions from certain dimensions. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques. We demonstrate these properties on OpenAI Gym benchmark tasks, and illustrate how discretizing the action space could benefit the exploration phase and hence facilitate convergence to a better local optimal solution thanks to the flexibility of discrete policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题