在非信号交叉点上连接和自动化车辆的协调：一种基于价值分解的多基因深钢筋学习方法

论文标题

在非信号交叉点上连接和自动化车辆的协调：一种基于价值分解的多基因深钢筋学习方法

Coordination for Connected and Automated Vehicles at Non-signalized Intersections: A Value Decomposition-based Multiagent Deep Reinforcement Learning Approach

论文作者

Guo, Zihan, Wu, Yan, Wang, Lifang, Zhang, Junzhi

论文摘要

多代理深钢筋学习研究（MDRL）的最新扩散提供了一种令人鼓舞的方式，可以协调多个连接和自动化车辆（CAVS）以通过交叉路口。在本文中，我们采用基于价值分解的MDRL方法（QMIX）来控制不同密度的混合自主流量中的各种骑士，以有效，安全地通过无信号的交叉路口，并获得丰富的燃料消耗。实施技巧，包括网络级改进，TD（$λ$）的Q值更新以及奖励剪辑操作都添加到纯QMIX框架中，纯QMIX框架有望提高收敛速度和原始版本的渐近性能。几个评估指标证明了我们方法的功效：平均速度，碰撞数量和每集平均燃油消耗。实验结果表明，我们的方法的收敛速度和渐近性能可以超过原始QMIX和近端策略优化（PPO）（PPO），这是一种最先进的增强增强学习基线基线，应用于非信号交叉点。此外，由我们方法控制的较低交通流量下的骑士可以提高其平均速度而无需碰撞并消耗最少的燃料。该培训还在交通密度的两倍下进行，其中学习奖励收敛。因此，具有最大奖励和最低崩溃的模型仍然可以保证低燃料消耗，但稍微降低了车辆的效率并引起了较低的人流交易效率，这意味着将RL政策推广到更高级的情况。

The recent proliferation of the research on multi-agent deep reinforcement learning (MDRL) offers an encouraging way to coordinate multiple connected and automated vehicles (CAVs) to pass the intersection. In this paper, we apply a value decomposition-based MDRL approach (QMIX) to control various CAVs in mixed-autonomy traffic of different densities to efficiently and safely pass the non-signalized intersection with fairish fuel consumption. Implementation tricks including network-level improvements, Q value update by TD ($λ$), and reward clipping operation are added to the pure QMIX framework, which is expected to improve the convergence speed and the asymptotic performance of the original version. The efficacy of our approach is demonstrated by several evaluation metrics: average speed, the number of collisions, and average fuel consumption per episode. The experimental results show that our approach's convergence speed and asymptotic performance can exceed that of the original QMIX and the proximal policy optimization (PPO), a state-of-the-art reinforcement learning baseline applied to the non-signalized intersection. Moreover, CAVs under the lower traffic flow controlled by our method can improve their average speed without collisions and consume the least fuel. The training is additionally conducted under the doubled traffic density, where the learning reward converges. Consequently, the model with maximal reward and minimum crashes can still guarantee low fuel consumption, but slightly reduce the efficiency of vehicles and induce more collisions than the lower-traffic counterpart, implying the difficulty of generalizing RL policy to more advanced scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题