论文标题

部分可观测时空混沌系统的无模型预测

Meta-Gradients in Non-Stationary Environments

论文作者

Luketina, Jelena, Flennerhag, Sebastian, Schroecker, Yannick, Abel, David, Zahavy, Tom, Singh, Satinder

论文摘要

元梯度方法(Xu等,2018; Zahavy等,2020)为非平稳加强学习问题中的超参数选择和适应性问题提供了有希望的解决方案。但是,尚未系统地研究此类环境中元梯度的特性。在这项工作中,我们对非平稳环境中的元梯度提出了新的清晰度。具体而言,我们问:(i)应向学习的优化者提供多少信息,以使一生中更快的适应和概括,(ii)在此过程中学习了哪些元耐药功能,以及(iii)在高度非基础的环境中,元梯度方法是否在更大的优势中提供了更大的优势。为了研究提供给元淘汰的信息的影响,如最近的作品(Flennerhag等,2021; Almeida等,2021),我们替换了所选上下文特征的学习元参数功能的固定更新规则的调谐元参数。上下文功能携带有关代理性能和环境变化的信息,因此可以告知学习的元参数计划。我们发现,添加更多上下文信息通常是有益的,从而导致元参数值更快地适应并在一生中提高绩效。我们通过定性分析对产生的元参数计划和上下文特征的学习功能进行定性分析来支持这些结果。最后,我们发现没有上下文,在高度非平稳的环境中,元梯度与基线相比并不提供一致的优势。我们的发现表明,上下文化的元梯度可以在非平稳设置中的元梯度中提取高性能方面发挥关键作用。

Meta-gradient methods (Xu et al., 2018; Zahavy et al., 2020) offer a promising solution to the problem of hyperparameter selection and adaptation in non-stationary reinforcement learning problems. However, the properties of meta-gradients in such environments have not been systematically studied. In this work, we bring new clarity to meta-gradients in non-stationary environments. Concretely, we ask: (i) how much information should be given to the learned optimizers, so as to enable faster adaptation and generalization over a lifetime, (ii) what meta-optimizer functions are learned in this process, and (iii) whether meta-gradient methods provide a bigger advantage in highly non-stationary environments. To study the effect of information provided to the meta-optimizer, as in recent works (Flennerhag et al., 2021; Almeida et al., 2021), we replace the tuned meta-parameters of fixed update rules with learned meta-parameter functions of selected context features. The context features carry information about agent performance and changes in the environment and hence can inform learned meta-parameter schedules. We find that adding more contextual information is generally beneficial, leading to faster adaptation of meta-parameter values and increased performance over a lifetime. We support these results with a qualitative analysis of resulting meta-parameter schedules and learned functions of context features. Lastly, we find that without context, meta-gradients do not provide a consistent advantage over the baseline in highly non-stationary environments. Our findings suggest that contextualizing meta-gradients can play a pivotal role in extracting high performance from meta-gradients in non-stationary settings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源