部分监测的最佳世界世界算法

论文标题

部分监测的最佳世界世界算法

Best-of-Both-Worlds Algorithms for Partial Monitoring

论文作者

Tsuchiya, Taira, Ito, Shinji, Honda, Junya

论文摘要

这项研究考虑了$ k $ actions和$ d $ outcomes的部分监测问题，并提供了第一个最佳世界世界算法，其遗憾在随机和对抗性方面都受到了极大的影响。特别是，我们表明，对于本地可观察到的非分类的游戏，遗憾的是$ O（M^2 K^4 \ log（t）\ log（k_πt） /δ_ {\ min}）在随机制度中，在$ o（m k^{2/3} \ sqrt {t \ sqrt {t \ s______________________________________ $ t $是回合的数量，$ m $是每个动作不同观察值的最大数量，$Δ_ {\ min} $是最小次优差距，而$k_π$是帕累托最佳操作的数量。此外，我们表明，对于全球可观察的游戏，遗憾的是$ O（c _ {\ Mathcal {g}}}^2 \ log（t）\ log（k_πt） /δ_ {\ min}^2）$ t））^{1/3} t^{2/3}）$在对抗性方案中，其中$ c _ {\ Mathcal {g}} $是游戏依赖的常数。我们还为具有对抗性腐败的随机政权提供了遗憾的界限。我们的算法基于以下规范化领导者框架，并受到通过优化和通过反馈图在线学习领域的自适应学习率的启发。

This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \log(T) \log(k_Π T) / Δ_{\min})$ in the stochastic regime and $O(m k^{2/3} \sqrt{T \log(T) \log k_Π})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $Δ_{\min}$ is the minimum suboptimality gap, and $k_Π$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\mathcal{G}}^2 \log(T) \log(k_Π T) / Δ_{\min}^2)$ in the stochastic regime and $O((c_{\mathcal{G}}^2 \log(T) \log(k_Π T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题