迈向实用的插座扩散模型

论文标题

迈向实用的插座扩散模型

Towards Practical Plug-and-Play Diffusion Models

论文作者

Go, Hyojun, Lee, Yunsung, Kim, Jin-Young, Lee, Seunghyun, Jeong, Myeongho, Lee, Hyun Seung, Choi, Seungtaek

论文摘要

基于扩散的生成模型在图像生成中取得了巨大的成功。他们的指导公式允许外部模型播放插件，以控制各种任务的生成过程，而无需对扩散模型进行填充。但是，由于其在嘈杂的输入方面的性能不佳，因此直接使用公开现成的模型来进行指导。为此，现有的做法是用噪音损坏的标记数据来微调指导模型。在本文中，我们认为这种做法在两个方面都有局限性：（1）对单个指导模型的输入进行表现太难；（2）收集标记的数据集阻碍缩放各种任务。为了应对限制，我们提出了一种新型策略，该策略利用了多个专家，其中每个专家都专门针对特定的噪声范围，并指导其相应的时间段扩散的反向过程。但是，由于管理多个网络并使用标记的数据是不可行的，因此我们提出了一个实用的指导框架，称为实用插件（PPAP），该框架利用参数有效的微调和无数据知识传输。我们详尽地进行了ImageNet类有条件生成实验，以表明我们的方法可以通过小型可训练参数成功地引导扩散，并且没有标记的数据。最后，我们表明图像分类器，深度估计器和语义分割模型可以以插件的方式指导通过我们的框架进行公开滑行。我们的代码可在https://github.com/riiid/ppap上找到。

Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without finetuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题