时尚的蒙面视觉变压器

论文标题

时尚的蒙面视觉变压器

Masked Vision-Language Transformer in Fashion

论文作者

Ji, Ge-Peng, Zhuge, Mingcheng, Gao, Dehong, Fan, Deng-Ping, Sakaridis, Christos, Van Gool, Luc

论文摘要

我们提出了一个蒙版的视觉变压器（MVLT），用于特定于时尚的多模式表示。从技术上讲，我们仅利用Vision Transformer架构来替换训练模型中的BERT，从而使MVLT成为时尚域的第一个端到端框架。此外，我们设计了蒙面的图像重建（MIR），以对时尚有细粒度的了解。 MVLT是一种可扩展且方便的体系结构，它可以接收原始的多模式输入，而无需额外的预处理模型（例如Resnet），可以隐含地对视觉语言对准进行建模。更重要的是，MVLT可以轻松地将其推广到各种匹配和生成任务。实验结果表明，在2018年冠军Kaleido-Bert中，检索（5：17％）和识别（准确性：3％）的任务明显改善（准确性：3％）。代码可在https://github.com/gewelsji/mvlt上提供。

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题