POMRL：随着地平线的增加，无需重新学习计划

论文标题

POMRL：随着地平线的增加，无需重新学习计划

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

论文作者

Khetarpal, Khimya, Vernade, Claire, O'Donoghue, Brendan, Singh, Satinder, Zahavy, Tom

论文摘要

我们研究了模型不确定性下的规划问题，在在线元强化学习（RL）设置中，在其中，代理人会呈现一系列相关任务，每个任务相互作用有限。代理可以在每个任务和跨任务中使用其经验来估计过渡模型和任务分布。我们提出了一种算法，以使跨任务的基础结构进行元素，并利用它来计划每个任务，并在计划损失的遗憾中造成遗憾。我们的界限表明，随着任务数量的增加，任务的平均遗憾会减少，并且任务更加相似。在经典的单任务设置中，众所周知，计划范围应取决于估计的模型的准确性，即任务中的样本数量。我们将此发现概括为元RL，并研究了计划视野对任务数量的依赖性。根据我们的理论发现，我们得出了选择缓慢增加折现因子的启发式方法，并从经验上验证其意义。

We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题