论文标题
通过小组剪裁探索差异私人深度学习的极限
Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping
论文作者
论文摘要
私人深度学习最近见证了计算效率和隐私 - 私人折衷方面的进步。我们探索是否可能沿着两个轴进行进一步的改进,并提供肯定的答案,以利用\ emph {group-wise剪切}的两个实例化。为了减少私人学习的计算时间开销,我们表明\ emph {每层剪辑},其中每个神经网络层的梯度被分别剪切,允许在不同私有优化中与反向流量一起进行剪辑。这会导致私人学习的记忆效率高,并且每次培训更新几乎与非私有学习更新一样,对于许多感兴趣的工作流程。尽管具有恒定阈值的每层剪辑往往不足以表现标准的平坦剪裁,但在给定训练时期的限制下,具有自适应阈值匹配的每层剪裁或胜过平坦的剪裁,因此在较小的墙壁时间内实现相似或更好的任务性能。为了探索差异私人深度学习中扩展模型的局限性(我们私下来),我们将1750亿参数GPT-3进行微调。我们绕过与夹层梯度相关的挑战,这些挑战与\ emph {每位剪辑}分布在多个设备上,该梯度将每个模型零件的梯度分别在其主机设备上分别夹住。私人微调的GPT-3具有每设备剪辑的私人调整,其任务性能的$ε= 1 $比非挑剔的汇总任务中最大的GPT-2可以实现的要实现。
Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $ε=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.