论文标题
一般强化学习中的巨大动作空间的准确降低
Exact Reduction of Huge Action Spaces in General Reinforcement Learning
论文作者
论文摘要
增强学习(RL)框架正式通过互动进行学习的概念。许多现实世界中的问题具有较大的状态空间和/或动作空间,例如GO,星际争霸,蛋白质折叠和机器人技术,或者是非马克维亚人,这对RL算法造成了重大挑战。在这项工作中,我们通过对动作进行测序解决了大型动作空间问题,这可以大大降低动作空间的大小,甚至可以降低到两个动作,而牺牲了计划的范围增加。我们为所有基于历史记录的过程提供明确和确切的构造和等效证明。就MDP而言,这可能有助于RL算法引导。在这项工作中,我们展示了非MDP情况下的动作量如何可以显着改善极端状态聚集(ESA)界限。 ESA允许在替代马尔可夫过程的帮助下,将任何(非MDP,非共性,基于历史记录)的RL问题投入到固定尺寸的非马克维亚州空间中。从好的方面来说,ESA具有与Markovian型号一样的最佳保证。但是缺点是,汇总状态空间的大小在动作空间的大小上变成指数。在这项工作中,我们通过将动作空间进行二进制来修补这个问题。我们在原始动作空间大小的对数(双指数改进)中提供了对数的二进制ESA状态数量的上限。
The reinforcement learning (RL) framework formalizes the notion of learning with interactions. Many real-world problems have large state-spaces and/or action-spaces such as in Go, StarCraft, protein folding, and robotics or are non-Markovian, which cause significant challenges to RL algorithms. In this work we address the large action-space problem by sequentializing actions, which can reduce the action-space size significantly, even down to two actions at the expense of an increased planning horizon. We provide explicit and exact constructions and equivalence proofs for all quantities of interest for arbitrary history-based processes. In the case of MDPs, this could help RL algorithms that bootstrap. In this work we show how action-binarization in the non-MDP case can significantly improve Extreme State Aggregation (ESA) bounds. ESA allows casting any (non-MDP, non-ergodic, history-based) RL problem into a fixed-sized non-Markovian state-space with the help of a surrogate Markovian process. On the upside, ESA enjoys similar optimality guarantees as Markovian models do. But a downside is that the size of the aggregated state-space becomes exponential in the size of the action-space. In this work, we patch this issue by binarizing the action-space. We provide an upper bound on the number of states of this binarized ESA that is logarithmic in the original action-space size, a double-exponential improvement.