论文标题
双重不均匀的加固学习
Doubly Inhomogeneous Reinforcement Learning
论文作者
论文摘要
本文研究了在时间非平稳性和主题异质性下双重不均匀环境中的增强学习(RL)。在许多应用程序中,遇到可能会随着时间和人群而变化的系统动态生成的数据集很普遍,从而挑战了高质量的顺序决策。但是,大多数现有的RL解决方案都需要时间平稳性或主体同质性,如果两个假设都受到违反,这将导致次级政策。 To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be此外,通过借用时间和人口借用信息,我们可以检测到较弱的信号,并且与每个时间应用聚类算法或更改点检测算法相比,我们具有更好的收敛性。
This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.