论文标题

使用密度估算的噪声估计进行自我监督的多模式学习

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

论文作者

Amrani, Elad, Ben-Ari, Rami, Rotman, Daniel, Bronstein, Alex

论文摘要

使机器学习模型理解和解决现实世界任务的关键因素之一是利用多模式数据。不幸的是,多模式数据的注释具有挑战性且昂贵。最近,提出了将视觉和语言结合起来的自我监管的多模式方法,以学习多模式表示而无需注释。但是,这些方法通常选择忽略高水平的噪声的存在,从而产生亚最佳结果。在这项工作中,我们表明,多模式数据的噪声估计问题可以简化为多模式密度估计任务。使用多模式密度估计,我们为多模式表示学习提出了一个噪声估计构建块,该块严格基于不同模态之间的固有相关性。我们证明了如何将噪声估算广泛整合在一起,并在五个不同的基准数据集上获得了可比的结果,以实现两个具有挑战性的多模式任务:视频问题回答和文本到视频回收。此外,我们提供了一个理论上的概率误差,结合了我们的经验结果并分析故障案例。代码:https://github.com/elad-amrani/ssml。

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task. Using multimodal density estimation, we propose a noise estimation building block for multimodal representation learning that is based strictly on the inherent correlation between different modalities. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases. Code: https://github.com/elad-amrani/ssml.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源