基于变压器的多模式建议和Wikipedia图像匹配的重新排列

论文标题

基于变压器的多模式建议和Wikipedia图像匹配的重新排列

Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

论文作者

Messina, Nicola, Coccomini, Davide Alessandro, Esuli, Andrea, Falchi, Fabrizio

论文摘要

随着网络和在线百科全书的可访问性的增加，要管理的数据量正在不断增加。例如，在Wikipedia中，有数百万页用多种语言编写。这些页面包含通常缺乏文本上下文的图像，在概念上保持浮动，因此很难找到和管理。在这项工作中，我们介绍了我们设计的系统，旨在参加Kaggle上的Wikipedia图像捕捉匹配挑战，其目标是使用与图像（URL和视觉数据）相关的数据，以在大量可用图像中找到正确的标题。能够执行此任务的系统将改善大型在线百科全书上多媒体内容的可访问性和完整性。具体而言，我们提出了一个由最近的变压器模型提供支持的两个模型的级联，能够有效地推断出查询图像数据和字幕之间的相关得分。我们通过广泛的实验来验证，提出的两模型方法是处理大量图像和字幕的有效方法，同时保持了推理时的整体计算复杂性。我们的方法取得了显着的结果，在Kaggle Challenge的私人排行榜上获得了0.53的归一化折扣累积增益（NDCG）值。

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题