加菲猫：拜占庭机器学习的系统支持

论文标题

加菲猫：拜占庭机器学习的系统支持

Garfield: System Support for Byzantine Machine Learning

论文作者

Guerraoui, Rachid, Guirguis, Arsany, Plassmann, Jérémy Max, Ragot, Anton Alexandre, Rouault, Sébastien

论文摘要

我们提出了Garfield，这是一个透明地制作机器学习（ML）应用程序的库，最初是由流行（但脆弱的）框架构建的，例如Tensorflow和Pytorch，Byzantine-risorient。加菲尔德依靠一种新颖的面向对象的设计，减少编码工作，并解决共享编织体系结构的脆弱性，然后是经典的ML框架。加菲尔德（Garfield）涵盖了各种通信模式，并支持对CPU和GPU的计算，从而解决了基于SGD的ML应用程序中拜占庭抗韧性非常实际的成本的总体问题。我们报告了加菲尔德在三个主要ML体系结构上使用的使用：（a）一家具有多个工人的服务器，（b）几个服务器和工人，以及（c）点对点设置。使用加菲猫，我们重点介绍了几个有趣的事实，内容涉及拜占庭的弹性成本。特别是（a）拜占庭的弹性与崩溃的弹性不同，引起准确性损失，（b）吞吐量开销更多来自沟通，而不是来自稳健的聚集，并且（c）容忍拜占庭服务器的成本比容忍拜占庭工人更高。

We present Garfield, a library to transparently make machine learning (ML) applications, initially built with popular (but fragile) frameworks, e.g., TensorFlow and PyTorch, Byzantine-resilient. Garfield relies on a novel object-oriented design, reducing the coding effort, and addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks. Garfield encompasses various communication patterns and supports computations on CPUs and GPUs, allowing addressing the general question of the very practical cost of Byzantine resilience in SGD-based ML applications. We report on the usage of Garfield on three main ML architectures: (a) a single server with multiple workers, (b) several servers and workers, and (c) peer-to-peer settings. Using Garfield, we highlight several interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine resilience, unlike crash resilience, induces an accuracy loss, (b) the throughput overhead comes more from communication than from robust aggregation, and (c) tolerating Byzantine servers costs more than tolerating Byzantine workers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题