感知者-VL：有效的视觉和语言建模，迭代潜在的关注

论文标题

感知者-VL：有效的视觉和语言建模，迭代潜在的关注

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

论文作者

Tang, Zineng, Cho, Jaemin, Lei, Jie, Bansal, Mohit

论文摘要

我们提出感知者-VL，这是一个视觉和语言框架，可有效处理高维多模式输入，例如长视频和文本。与许多基于最先进的变压器模型中使用的自我注意力的二次复杂性相比，我们的框架具有线性复杂性，由迭代的潜在跨注意力延伸，我们的框架尺度具有线性复杂性。为了进一步提高框架的效率，我们还研究了在跨注意层上应用LayerDrop，并引入了混合式架构以进行跨模式检索。我们在不同的视频文本和图像文本基准上评估了感知者-VL，在维持竞争性能的同时，感知者-VL达到了最低的Gflops和延迟。此外，我们还提供了框架各个方面的全面分析，包括预处理数据，潜在尺寸和输入大小的可伸缩性，推断时进行跨注意层下降，以减少潜伏期，模态聚集策略，位置编码和权重初始化策略。我们的代码和检查点可在以下网址找到：https：//github.com/zinengtang/perceiver_vl

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy. Our code and checkpoints are available at: https://github.com/zinengtang/Perceiver_VL

下载PDF全文

下载文献需遵守相关版权规定

论文标题