论文标题

高斯流程在弱可观察的环境中自触发的政策搜索

Gaussian Process Self-triggered Policy Search in Weakly Observable Environments

论文作者

Sasaki, Hikaru, Hirabayashi, Terushi, Kawabata, Kaoru, Matsubara, Takamitsu

论文摘要

诸如废物焚化厂中的废物起重机之类的大型工业机器的环境通常是弱观察到的,在观测中,由于技术困难或维护成本而导致的观察结果中,几乎没有关于环境状态的信息(例如,没有传感器可以观察到要处理的垃圾状态)。基于这样的发现,在这种环境中熟练的操作员选择预定的控制策略(例如,抓握和散射)及其基于传感器值的持续时间,从而改善了其行动的鲁棒性,我们提出了一种新颖的非参与性策略搜索算法:高斯流程过程自我触发的政策搜索(GPSTPS)。 GPSTP具有两种类型的控制策略:动作和持续时间。门控机制要么维护行动策略在持续时间策略指定的持续时间内选择的动作,要么通过将新的观察结果传递给策略来更新操作和持续时间;因此,它被归类为自触发。 GPSTP同时通过基于稀疏的GP先验和变异学习来同时通过反复试验来学习这两个策略,以最大程度地提高回报。为了验证我们提出的方法的性能,我们对使用模拟和机器人废物起重机系统进行弱观测的废式起重机进行了关于垃圾碎片碎片任务的实验。作为实验结果,提出的方法获得了合适的策略,以根据垃圾的特征来确定作用和持续时间。

The environments of such large industrial machines as waste cranes in waste incineration plants are often weakly observable, where little information about the environmental state is contained in the observations due to technical difficulty or maintenance cost (e.g., no sensors for observing the state of the garbage to be handled). Based on the findings that skilled operators in such environments choose predetermined control strategies (e.g., grasping and scattering) and their durations based on sensor values, %thereby improving the robustness of their actions, we propose a novel non-parametric policy search algorithm: Gaussian process self-triggered policy search (GPSTPS). GPSTPS has two types of control policies: action and duration. A gating mechanism either maintains the action selected by the action policy for the duration specified by the duration policy or updates the action and duration by passing new observations to the policy; therefore, it is categorized as self-triggered. GPSTPS simultaneously learns both policies by trial and error based on sparse GP priors and variational learning to maximize the return. To verify the performance of our proposed method, we conducted experiments on garbage-grasping-scattering task for a waste crane with weak observations using a simulation and a robotic waste crane system. As experimental results, the proposed method acquired suitable policies to determine the action and duration based on the garbage's characteristics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源