长时间的增强学习是否比短范围的增强学习更加困难？

论文标题

长时间的增强学习是否比短范围的增强学习更加困难？

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

论文作者

Wang, Ruosong, Du, Simon S., Yang, Lin F., Kakade, Sham M.

论文摘要

学习计划长距离是情节增强学习问题的核心挑战。一个基本问题是了解问题的难度如何随着地平线的增加而缩小。在这里，样本复杂性的自然量度是一种归一化的量度：我们对发现的策略所需的发作数量很感兴趣，其价值为$ \ varepsilon $附近的策略，其值是最佳值的，其中该值是通过每个情节中的归一化累积奖励来衡量的。在柯尔特2018年的开放式问题中，江和阿加瓦尔（Agarwal）猜想，对于表情的，发作性的增强学习问题，存在样本复杂性下限，它表现出对地平线的多项式依赖性 - 一种与所有已知样品复杂性上限一致的猜想。这项工作驳斥了这一猜想，证明了表个性的，情节的增强学习是可以通过样本复杂性来进行的，该样本复杂性仅与计划范围进行对数。换句话说，当值适当地归一化时（位于单位间隔）时，这表明至少在最小值的意义上，长度的RL并不比短级RL更困难。我们的分析介绍了两个想法：（i）构建$ \ varepsilon $ -NET，用于最佳策略，其登录数量仅在计划范围内与对数进行对数尺度，以及（ii）在线轨迹合成算法，在线策略类别中使用样本策略级别的策略分类中适应了所有策略的策略，该策略均可在策略中适应了分类数字，该策略均可评估定位数字的数字。两者都可能具有独立的利益。

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题