基于模型的强化学习的客观不匹配

论文标题

基于模型的强化学习的客观不匹配

Objective Mismatch in Model-based Reinforcement Learning

论文作者

Lambert, Nathan, Amos, Brandon, Yadan, Omry, Calandra, Roberto

论文摘要

基于模型的增强学习（MBRL）已被证明是数据有效地学习连续任务的强大框架。 MBRL的最新工作主要集中在使用更高级功能近似程序和计划方案，而一般框架的发展很少。在本文中，我们确定了标准MBR框架的基本问题 - 我们称之为客观不匹配问题。当一个目标得到优化时，就会出现客观不匹配，以期希望一秒钟（通常是不相关）也将优化。在MBRL的背景下，我们表征了训练前向动力学模型W.R.T.之间的客观不匹配〜一步前进的可能性以及提高下游控制任务的性能的总体目标。例如，这个问题可能会意识到有效的特定任务的动态模型不一定需要在全球上精确，反之亦然，全球精确的模型可能在本地不够准确，以在特定任务上获得良好的控制性能。在我们的实验中，我们研究了这个客观的不匹配问题，并证明了一步前预测的可能性并不总是与控制绩效相关。该观察结果突出了MBRL框架中的关键局限性，这将需要对进一步的研究进行充分理解和解决。我们提出了一种通过重新加权动态模型培训来减轻不匹配问题的初始方法。在它的基础上，我们在讨论了解决此问题的其他潜在研究方向的讨论中进行了讨论。

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t.~the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.

下载PDF全文

下载文献需遵守相关版权规定

论文标题