确定性系统中功能近似的不可知论Q学习：近似误差和样品复杂性的紧密界限

论文标题

确定性系统中功能近似的不可知论Q学习：近似误差和样品复杂性的紧密界限

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

论文作者

Du, Simon S., Lee, Jason D., Mahajan, Gaurav, Wang, Ruosong

论文摘要

当前的论文研究不可知$ q $ - 在确定性系统中使用功能近似的不可知论的问题，其中最佳$ q $ function可以通过类$ \ MATHCAL {f} $的函数近似，并具有近似错误$Δ\ ge ge 0 $。 We propose a novel recursion-based algorithm and show that if $δ= O\left(ρ/\sqrt{\dim_E}\right)$, then one can find the optimal policy using $O\left(\dim_E\right)$ trajectories, where $ρ$ is the gap between the optimal $Q$-value of the best actions and that of the second-best actions and $ \ dim_e $是$ \ mathcal {f} $的eluder尺寸。我们的结果有两个含义： 1）与[Du等人，ICLR 2020]中的下限结合，我们的上限表明条件$δ= \widetildeθ\ left（ρ/\ sqrt {\ sqrt {\ mathrm {dim} _e} _e} _e} \ right）$是必要的，并且足以适合于具有algorithm具有多元素示例复杂度的algorithms。 2）与[Wen and van Roy，NIPS 2013]中的下限结合，我们的上限表明样品复杂性$ \widetildeθ\ left（\ mathrm {dim} _e \ right）$即使在不可知的环境中也很紧。因此，我们解决了[Wen and van Roy，Nips 2013]提出的不可知论$ Q $ - Q $ - Q $ Q $的公开问题。我们进一步将算法扩展到随机奖励设置，并获得类似的结果。

The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $δ\ge 0$. We propose a novel recursion-based algorithm and show that if $δ= O\left(ρ/\sqrt{\dim_E}\right)$, then one can find the optimal policy using $O\left(\dim_E\right)$ trajectories, where $ρ$ is the gap between the optimal $Q$-value of the best actions and that of the second-best actions and $\dim_E$ is the Eluder dimension of $\mathcal{F}$. Our result has two implications: 1) In conjunction with the lower bound in [Du et al., ICLR 2020], our upper bound suggests that the condition $δ= \widetildeΘ\left(ρ/\sqrt{\mathrm{dim}_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity. 2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetildeΘ\left(\mathrm{dim}_E\right)$ is tight even in the agnostic setting. Therefore, we settle the open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题