FRL-FI：基于联合强化学习的导航系统的瞬态故障分析

论文标题

FRL-FI：基于联合强化学习的导航系统的瞬态故障分析

FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems

论文作者

Wan, Zishen, Anwar, Aqeel, Mahmoud, Abdulrahman, Jia, Tianyu, Hsiao, Yu-Shun, Reddi, Vijay Janapa, Raychowdhury, Arijit

论文摘要

群体智能越来越多地部署在无人机和无人驾驶汽车等自主系统中。联合强化学习（FRL）是一种关键的群体智能范式，代理商与自己的环境进行互动，并在保留隐私的同时，合作地学习了一项共识政策，最近显示出潜在的优势并获得了知名度。但是，使用连续的技术节点缩放的硬件系统中的瞬态故障正在增加，并且可能对FRL系统构成威胁。同时，基于常规的冗余保护方法在资源受限的边缘应用程序上部署方面具有挑战性。在本文中，我们在各种尺度上对故障模型，故障位置，学习算法，层类型，通信间隔和数据类型在训练和推理阶段的实验评估了FRL导航系统的容错。我们进一步提出了两种具有成本效益的故障检测和恢复技术，可以在FRL系统中使用<2.7％的开销来提高弹性的3.3倍。

Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题