论文标题
梯度:重新思考固定值的广义离线估计
GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values
论文作者
论文摘要
我们提出了梯度,用于估计目标策略的状态分布与非政策增强学习中的采样分布之间的密度比。 Gradientdice解决了Gendice的几个问题(Zhang等,2020),这是估计这种密度比的最新问题。也就是说,一旦引入了优化变量参数化的非线性以确保阳性,因此Gendice中的优化问题不是凸形 - concove鞍点问题,因此不能保证任何原始的二算法都可以收敛或找到所需的解决方案。但是,这种非线性对于确保Gendice的一致性即使具有表格表示至关重要。这是一个根本的矛盾,是由于Gendice对优化问题的原始表述所致。在Gradientdice中,我们通过使用Perron-Frobenius定理来优化与Gendice不同的目标,并消除了Gendice对差异的使用。因此,参数化的非线性对于梯度不需要,在线性函数近似下证明是趋同的。
We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.