Q-Greedyucb：一种适用于自适应和资源有效计划的新探索政策

论文标题

Q-Greedyucb：一种适用于自适应和资源有效计划的新探索政策

Q-greedyUCB: a New Exploration Policy for Adaptive and Resource-efficient Scheduling

论文作者

Zhao, Yu, Lee, Joohyun, Chen, Wei

论文摘要

本文提出了一种学习算法，以找到在通信系统中实现最佳延迟功率权衡的调度策略。增强学习（RL）用于最大程度地减少给定能量限制的预期潜伏度，在这种情况下，诸如交通率或渠道条件等环境可能会随着时间而变化。为此，此问题被指定为具有约束的无限马尔可夫决策过程（MDP）。为了处理受限的优化问题，我们采用拉格朗日放松技术来解决它。然后，我们提出了Q-学习的一种Q-Gredyucb的一种变体，该变体结合了\ emph {pervasical}奖励算法和上限信心（UCB）策略（UCB）政策，以解决这一决策问题。我们证明，Q-Gredyucb算法是通过数学分析收敛的。仿真结果表明，Q-Greedyucb找到了最佳的调度策略，并且比Q学习更有效地使用$ \ varepsilon $ -greedy和平均薪资RL算法在累积奖励方面（即延迟和能量的加权总和）和收敛速度。我们还表明，与使用$ \ varepsilon $ - 果岭和平均付费RL算法的Q学习相比，我们的算法可以减少多达12％的遗憾。

This paper proposes a learning algorithm to find a scheduling policy that achieves an optimal delay-power trade-off in communication systems. Reinforcement learning (RL) is used to minimize the expected latency for a given energy constraint where the environments such as traffic arrival rates or channel conditions can change over time. For this purpose, this problem is formulated as an infinite-horizon Markov Decision Process (MDP) with constraints. To handle the constrained optimization problem, we adopt the Lagrangian relaxation technique to solve it. Then, we propose a variant of Q-learning, Q-greedyUCB that combines Q-learning for \emph{average} reward algorithm and Upper Confidence Bound (UCB) policy to solve this decision-making problem. We prove that the Q-greedyUCB algorithm is convergent through mathematical analysis. Simulation results show that Q-greedyUCB finds an optimal scheduling strategy, and is more efficient than Q-learning with the $\varepsilon$-greedy and Average-payoff RL algorithm in terms of the cumulative reward (i.e., the weighted sum of delay and energy) and the convergence speed. We also show that our algorithm can reduce the regret by up to 12% compared to the Q-learning with the $\varepsilon$-greedy and Average-payoff RL algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题