通过状态抽象来缩放边缘化的重要性采样到高维状态空间

论文标题

通过状态抽象来缩放边缘化的重要性采样到高维状态空间

Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction

论文作者

Pavse, Brahma S., Hanna, Josiah P.

论文摘要

我们考虑了强化学习（RL）中销售评估（OPE）的问题，其中的目标是使用固定的数据集（$ \ Mathcal {d} $）估计评估政策的性能，$π_e$，通过一个或多个策略收集，可能与$π_e$不同。当前的OPE算法可能在策略分布变化下产生较差的OPE估计值，即，当在$π_e$下发生的特定状态行动对的概率与$ \ Mathcal {D} $中发生的同一对的概率大不相同（Voloshin等人（Voloshin et al.2021），Fu等，2021）。在这项工作中，我们建议使用来自状态抽象文献的概念将高维状态空间投射到低维状态空间中，以提高OPE估计器的准确性。具体而言，我们考虑边缘化的重要性采样（MIS）OPE算法，该算法计算国家行动分布校正比以产生其OPE估计。在原始的基态空间中，这些比率可能具有较高的差异，这可能会导致较高的差异操作。但是，我们证明在较低的抽象状态空间中，比率可以具有较低的方差，从而导致差异较低。然后，我们强调了从数据估算抽象比率，确定足够的条件以克服这些问题的挑战，并提出一个最小值优化问题，该问题的解决方案产生了这些抽象比率。最后，我们对困难，高维状态的OPE任务的经验评估表明，抽象比率可以使MIS OPE估计器获得均值较低的误差，并且比地面比率更稳定。

We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, $π_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more policies that may be different from $π_e$. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular state-action pair occurring under $π_e$ is very different from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground state-space, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题