LANIT：无标记数据的语言驱动图像到图像翻译

论文标题

LANIT：无标记数据的语言驱动图像到图像翻译

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

论文作者

Park, Jihye, Kim, Sunwoo, Kim, Soohyun, Cho, Seokju, Yoo, Jaejun, Uh, Youngjung, Kim, Seungryong

论文摘要

现有的图像到图像翻译技术通常遭受了两个关键问题：严重依赖按样本域注释和/或无法处理每个图像的多个属性的能力。最近的真正无需观察的方法采用聚类方法来轻松提供按样本的单热域标签。但是，它们无法解释现实世界的设置：一个样本可能具有多个属性。此外，集群的语义不容易与人类的理解相结合。为了克服这些，我们提出了一种语言驱动的图像到图像翻译模型，称为lanit。我们利用数据集文本中给出的易于访问的候选属性：图像和属性之间的相似性指示每样本域标签。此公式自然可以启用多热标签，以便用户可以用语言中的一组属性指定目标域。为了说明最初提示不准确的情况，我们还提出了及时的学习。我们进一步介绍了域的正则损失，该域损失将映射到相应的域。对几种标准基准测试的实验表明，LANIT与现有模型相当或卓越的性能。

Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to the human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot label so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题