论文标题
时尚的蒙面视觉变压器
Masked Vision-Language Transformer in Fashion
论文作者
论文摘要
我们提出了一个蒙版的视觉变压器(MVLT),用于特定于时尚的多模式表示。从技术上讲,我们仅利用Vision Transformer架构来替换训练模型中的BERT,从而使MVLT成为时尚域的第一个端到端框架。此外,我们设计了蒙面的图像重建(MIR),以对时尚有细粒度的了解。 MVLT是一种可扩展且方便的体系结构,它可以接收原始的多模式输入,而无需额外的预处理模型(例如Resnet),可以隐含地对视觉语言对准进行建模。更重要的是,MVLT可以轻松地将其推广到各种匹配和生成任务。实验结果表明,在2018年冠军Kaleido-Bert中,检索(5:17%)和识别(准确性:3%)的任务明显改善(准确性:3%)。代码可在https://github.com/gewelsji/mvlt上提供。
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.