使用最佳基线的得分函数降低差异

论文标题

使用最佳基线的得分函数降低差异

Variance Reduction for Score Functions Using Optimal Baselines

论文作者

Keane, Ronan, Gao, H. Oliver

论文摘要

许多问题涉及使用模型，这些模型学习概率分布或以某种方式结合随机性。在此类问题中，由于计算真实的预期梯度可能是棘手的，因此使用梯度估计器来更新模型参数。当模型参数直接影响概率分布时，梯度估计器将涉及得分函数项。本文研究基线，这是一种降低得分功能的差异技术。我们主要是通过强化学习的动机，这是我们首次获得最佳状态依赖性基线的表达，这是基线，该基线导致具有最小差异的梯度估计器。尽管我们证明了存在最佳基线可能比值函数基线好的示例，但我们发现值函数基线通常与降低方差相似。此外，值函数也可以用于返回的引导估计器，从而导致额外的差异。我们的结果为为什么价值功能基准和广义优势估计器（GAE）在实践中效果很好。

Many problems involve the use of models which learn probability distributions or incorporate randomness in some way. In such problems, because computing the true expected gradient may be intractable, a gradient estimator is used to update the model parameters. When the model parameters directly affect a probability distribution, the gradient estimator will involve score function terms. This paper studies baselines, a variance reduction technique for score functions. Motivated primarily by reinforcement learning, we derive for the first time an expression for the optimal state-dependent baseline, the baseline which results in a gradient estimator with minimum variance. Although we show that there exist examples where the optimal baseline may be arbitrarily better than a value function baseline, we find that the value function baseline usually performs similarly to an optimal baseline in terms of variance reduction. Moreover, the value function can also be used for bootstrapping estimators of the return, leading to additional variance reduction. Our results give new insight and justification for why value function baselines and the generalized advantage estimator (GAE) work well in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题