基于激光混乱的平行强盗架构用于增强学习

论文标题

基于激光混乱的平行强盗架构用于增强学习

Parallel bandit architecture based on laser chaos for reinforcement learning

论文作者

Urushibara, Takashi, Chauvet, Nicolas, Kochi, Satoshi, Sunada, Satoshi, Kanno, Kazutaka, Uchida, Atsushi, Horisaki, Ryoichi, Naruse, Makoto

论文摘要

光子学加速人工智能是一个活跃的研究领域，旨在利用光子的独特特性。强化学习是机器学习的重要分支，并且在多军匪徒问题方面已经证明了光子决策原理。但是，强化学习可能涉及大量状态，这与以前证明的匪徒问题仅是一个国家不同。 Q学习是一种可以与许多州打交道的强化学习的众所周知的方法。但是，由于更新规则和操作选择，Q学习的体系结构不符合光子实现。在这项研究中，我们组织了一种新的架构，用于多态加固学习，作为一系列平行的匪徒问题，以便从光子决策者中受益，我们称之为平行的强盗体系结构，用于加固学习或简称PBRL。以一个实例，我们证明了PBRL在时间步骤中比Q学习更少的时间来适应环境。此外，与混乱的激光时间序列相比，PBRL的适应性比具有均匀分布的伪数字的情况更快，其中激光混沌中固有的自相关提供了积极的效果。我们还发现，该系统在学习阶段经历的各种状态在PBRL和Q学习之间表现出完全不同的特性。通过本研究获得的见解也对现有的计算平台，不仅是光子实现，在加速PBRL算法和相关的随机序列方面加速性能。

Accelerating artificial intelligence by photonics is an active field of study aiming to exploit the unique properties of photons. Reinforcement learning is an important branch of machine learning, and photonic decision-making principles have been demonstrated with respect to the multi-armed bandit problems. However, reinforcement learning could involve a massive number of states, unlike previously demonstrated bandit problems where the number of states is only one. Q-learning is a well-known approach in reinforcement learning that can deal with many states. The architecture of Q-learning, however, does not fit well photonic implementations due to its separation of update rule and the action selection. In this study, we organize a new architecture for multi-state reinforcement learning as a parallel array of bandit problems in order to benefit from photonic decision-makers, which we call parallel bandit architecture for reinforcement learning or PBRL in short. Taking a cart-pole balancing problem as an instance, we demonstrate that PBRL adapts to the environment in fewer time steps than Q-learning. Furthermore, PBRL yields faster adaptation when operated with a chaotic laser time series than the case with uniformly distributed pseudorandom numbers where the autocorrelation inherent in the laser chaos provides a positive effect. We also find that the variety of states that the system undergoes during the learning phase exhibits completely different properties between PBRL and Q-learning. The insights obtained through the present study are also beneficial for existing computing platforms, not just photonic realizations, in accelerating performances by the PBRL algorithms and correlated random sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题