论文标题
关于大型合作多机构强化学习中当地政策的近距离临近
On the Near-Optimality of Local Policies in Large Cooperative Multi-Agent Reinforcement Learning
论文作者
论文摘要
我们表明,在合作的$ n $ agent网络中,可以为代理设计本地可执行的策略,以使所得的平均奖励(值)的折现总和非常近似于计算出(包括非本地)策略的最佳值。具体而言,我们证明,如果$ | \ MATHCAL {X} |,| \ MATHCAL {U} | $表示状态大小和单个代理的动作空间,则对于足够小的折现因子,近似错误,近似错误由$ \ nathcal {o}(o}(e)$ e \ triangleq给出。 \ frac {1} {\ sqrt {n}} \ left [\ sqrt {| \ Mathcal {x} |} |}+\ sqrt {| \ Mathcal {U} |} |} |} |} \ right] $。此外,在一种特殊情况下,奖励和状态过渡功能独立于人口的动作分布,错误将提高到$ \ nathcal {o}(e)$其中$ e \ triangleq \ frac {1} {\ sqrt {\ sqrt {n}}}}}} \ sqrt {| \ sqrt {| \ nathcal {| \ Mathcalcal {| \ natercalcal {x}最后,我们还设计了一种算法来明确构建本地政策。在我们的近似结果的帮助下,我们进一步确定构建的本地策略在$ \ Mathcal {o}(\ max \ {e,ε\})$最佳策略的距离,并且实现此类本地策略的样本复杂性为$ \ Mathcal {O} {O}(O}(O})
We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then for sufficiently small discount factor, the approximation error is given by $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$. Moreover, in a special case where the reward and state transition functions are independent of the action distribution of the population, the error improves to $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$. Finally, we also devise an algorithm to explicitly construct a local policy. With the help of our approximation results, we further establish that the constructed local policy is within $\mathcal{O}(\max\{e,ε\})$ distance of the optimal policy, and the sample complexity to achieve such a local policy is $\mathcal{O}(ε^{-3})$, for any $ε>0$.