野外的强化学习：可扩展的RL派遣算法部署在乘车市场

论文标题

野外的强化学习：可扩展的RL派遣算法部署在乘车市场

Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm Deployed in Ridehailing Marketplace

论文作者

Eshkevari, Soheil Sadeghi, Tang, Xiaocheng, Qin, Zhiwei, Mei, Jinhan, Zhang, Cheng, Meng, Qianying, Xu, Jia

论文摘要

在这项研究中，提出了基于强化学习的实时调度算法，并且首次大规模部署。乘车平台中的当前调度方法主要基于近视或基于规则的非侧型方法。强化学习实现了派遣策略，这些政策已通知历史数据，并能够采用学习的信息来优化预期的未来轨迹的回报。以前在该领域的研究取得了令人鼓舞的结果，但在绩效增长，自依赖性，可转移性和可扩展的部署机制方面为进一步的改进留出了进一步的改进。本研究提出了一种基于RL的独立调度解决方案，该解决方案配备了多种机制，以确保强大，有效的式学习和推理，同时适应全尺度部署。提出了一种基于时间差异的新形式的价值更新，它更适合问题的固有不确定性。对于驾驶员订单分配，提出了一个自定义的公用事业功能，即根据市场的统计数据进行调整，会导致绩效的显着提高和解释性。此外，为了减少驾驶员分配后取消的风险，引入了基于多臂匪徒问题的自适应图修剪策略。使用具有真实数据的离线模拟评估该方法，并产生显着的性能改善。此外，该算法是在DIDI的A/B测试运营下在多个城市中在线部署的，并在主要国际市场之一中推出，作为主要派遣方式。部署的算法显示，A/B测试的驾驶员总收入提高了1.3％以上。此外，通过因果推理分析，全面部署后检测到主要绩效指标的高度提高了5.3％。

In this study, a real-time dispatching algorithm based on reinforcement learning is proposed and for the first time, is deployed in large scale. Current dispatching methods in ridehailing platforms are dominantly based on myopic or rule-based non-myopic approaches. Reinforcement learning enables dispatching policies that are informed of historical data and able to employ the learned information to optimize returns of expected future trajectories. Previous studies in this field yielded promising results, yet have left room for further improvements in terms of performance gain, self-dependency, transferability, and scalable deployment mechanisms. The present study proposes a standalone RL-based dispatching solution that is equipped with multiple mechanisms to ensure robust and efficient on-policy learning and inference while being adaptable for full-scale deployment. A new form of value updating based on temporal difference is proposed that is more adapted to the inherent uncertainty of the problem. For the driver-order assignment, a customized utility function is proposed that when tuned based on the statistics of the market, results in remarkable performance improvement and interpretability. In addition, for reducing the risk of cancellation after drivers' assignment, an adaptive graph pruning strategy based on the multi-arm bandit problem is introduced. The method is evaluated using offline simulation with real data and yields notable performance improvement. In addition, the algorithm is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets as the primary mode of dispatch. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing. In addition, by causal inference analysis, as much as 5.3% improvement in major performance metrics is detected after full-scale deployment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题