stylet2i：迈向构图和高保真文本对图像综合

论文标题

stylet2i：迈向构图和高保真文本对图像综合

StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis

论文作者

Li, Zhiheng, Min, Martin Renqiang, Li, Kai, Xu, Chenliang

论文摘要

尽管对于文本对图像的综合已经取得了进展，但先前的方法没有概括输入文本中未见或代表性不足的属性组成。缺乏组成性可能对鲁棒性和公平性具有严重的影响，例如，无法综合代表性不足的人群群体的面部图像。在本文中，我们介绍了一个新的框架stylet2i，以改善文本对图像合成的组成性。具体而言，我们提出了一个夹子引导的对比损失，以更好地区分不同句子之间的不同组成。为了进一步提高组成性，我们设计了一种新颖的语义匹配损失和空间约束，以确定属性对预期空间区域操作的潜在方向，从而使属性的潜在潜在表示更好。基于确定的属性潜在方向，我们提出了组成属性调整以调整潜在代码，从而使图像合成的更好组成性。此外，我们利用$ \ ell_2 $ - norm正规化已确定的潜在方向（规范罚款），以在图像文本对齐和图像保真度之间取得良好的平衡。在实验中，我们设计了一个新的数据集拆分和一个评估度量，以评估文本形象到图像合成模型的组成性。结果表明，STYLET2I在输入文本和合成图像之间的一致性方面优于先前的方法，并实现了更高的保真度。

Although progress has been made for text-to-image synthesis, previous methods fall short of generalizing to unseen or underrepresented attribute compositions in the input text. Lacking compositionality could have severe implications for robustness and fairness, e.g., inability to synthesize the face images of underrepresented demographic groups. In this paper, we introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis. Specifically, we propose a CLIP-guided Contrastive Loss to better distinguish different compositions among different sentences. To further improve the compositionality, we design a novel Semantic Matching Loss and a Spatial Constraint to identify attributes' latent directions for intended spatial region manipulations, leading to better disentangled latent representations of attributes. Based on the identified latent directions of attributes, we propose Compositional Attribute Adjustment to adjust the latent code, resulting in better compositionality of image synthesis. In addition, we leverage the $\ell_2$-norm regularization of identified latent directions (norm penalty) to strike a nice balance between image-text alignment and image fidelity. In the experiments, we devise a new dataset split and an evaluation metric to evaluate the compositionality of text-to-image synthesis models. The results show that StyleT2I outperforms previous approaches in terms of the consistency between the input text and synthesized images and achieves higher fidelity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题