随机梯度下降，具有离线加强学习的依赖数据

论文标题

随机梯度下降，具有离线加强学习的依赖数据

Stochastic Gradient Descent with Dependent Data for Offline Reinforcement Learning

论文作者

Dong, Jing, Tong, Xin T.

论文摘要

在加强学习（RL）中，离线学习从数据收集中解耦，并且在处理勘探 - 开发折衷方面非常有用，并在许多应用程序中启用了数据重复使用。在这项工作中，我们研究了两个离线学习任务：政策评估和政策学习。为了进行策略评估，我们将其作为随机优化问题提出，并表明可以使用近似随机梯度下降（ASGD）与时间依赖数据来解决它。我们显示，当损失函数强烈凸出并且速率与折现因子$γ$无关时，ASGD实现了$ \ tilde o（1/t）$收敛。可以将此结果扩展到包括算法，进行大致合同迭代，例如TD（0）。然后将策略评估算法与策略迭代算法结合使用，以学习最佳策略。为了达到$ε$的准确性，算法的复杂性为$ \ tilde o（ε^{ - 2}（1-γ）^{ - 5}）$，它与经典在线RL算法（例如Q-Learning）的复杂性匹配。

In reinforcement learning (RL), offline learning decoupled learning from data collection and is useful in dealing with exploration-exploitation tradeoff and enables data reuse in many applications. In this work, we study two offline learning tasks: policy evaluation and policy learning. For policy evaluation, we formulate it as a stochastic optimization problem and show that it can be solved using approximate stochastic gradient descent (aSGD) with time-dependent data. We show aSGD achieves $\tilde O(1/t)$ convergence when the loss function is strongly convex and the rate is independent of the discount factor $γ$. This result can be extended to include algorithms making approximately contractive iterations such as TD(0). The policy evaluation algorithm is then combined with the policy iteration algorithm to learn the optimal policy. To achieve an $ε$ accuracy, the complexity of the algorithm is $\tilde O(ε^{-2}(1-γ)^{-5})$, which matches the complexity bound for classic online RL algorithms such as Q-learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题