论文标题
部分监测的最佳世界世界算法
Best-of-Both-Worlds Algorithms for Partial Monitoring
论文作者
论文摘要
这项研究考虑了$ k $ actions和$ d $ outcomes的部分监测问题,并提供了第一个最佳世界世界算法,其遗憾在随机和对抗性方面都受到了极大的影响。特别是,我们表明,对于本地可观察到的非分类的游戏,遗憾的是$ O(M^2 K^4 \ log(t)\ log(k_πt) /δ_ {\ min})在随机制度中,在$ o(m k^{2/3} \ sqrt {t \ sqrt {t \ s______________________________________ $ t $是回合的数量,$ m $是每个动作不同观察值的最大数量,$Δ_ {\ min} $是最小次优差距,而$k_π$是帕累托最佳操作的数量。此外,我们表明,对于全球可观察的游戏,遗憾的是$ O(c _ {\ Mathcal {g}}}^2 \ log(t)\ log(k_πt) /δ_ {\ min}^2)$ t))^{1/3} t^{2/3})$在对抗性方案中,其中$ c _ {\ Mathcal {g}} $是游戏依赖的常数。我们还为具有对抗性腐败的随机政权提供了遗憾的界限。我们的算法基于以下规范化领导者框架,并受到通过优化和通过反馈图在线学习领域的自适应学习率的启发。
This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \log(T) \log(k_Π T) / Δ_{\min})$ in the stochastic regime and $O(m k^{2/3} \sqrt{T \log(T) \log k_Π})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $Δ_{\min}$ is the minimum suboptimality gap, and $k_Π$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\mathcal{G}}^2 \log(T) \log(k_Π T) / Δ_{\min}^2)$ in the stochastic regime and $O((c_{\mathcal{G}}^2 \log(T) \log(k_Π T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.