论文标题
深度多模式内容理解的新想法和趋势:评论
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
论文作者
论文摘要
这项调查的重点是分析多模式深度学习的两种方式:图像和文本。与对深度学习的经典评论不同,诸如VGG,Resnet和Inception模块之类的单极图像分类器是中心主题,本文将研究最近的多模式深层模型和结构,包括自动编码器,生成的对抗性网及其变体。这些模型超出了简单的图像分类器,它们可以在其中完成单向(例如图像字幕,图像生成)和双向(例如跨模式检索,视觉问题答案)多模式任务。此外,我们分析了挑战的两个方面,这些方面是在深度多模式应用中更好地理解内容的方面。然后,我们介绍了深层多模式学习的当前思想和趋势,例如功能嵌入方法和客观功能设计,这对于克服上述挑战至关重要。最后,我们包括几个有希望的未来研究方向。
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.