使用基于SVD的重复专家混合物的多样化和风格的图像字幕

论文标题

使用基于SVD的重复专家混合物的多样化和风格的图像字幕

Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts

论文作者

Heidari, Marzieh, Ghatee, Mehdi, Nickabadi, Ahmad, Nezhad, Arash Pourhasan

论文摘要

随着视觉和自然语言处理的巨大进步，图像标题的产生成为需要。在最近的一篇论文中，马修斯（Mathews），XIE和他[1]扩展了一种新模型，通过分开语义和样式来生成样式的字幕。在继续这项工作的过程中，在这里开发了一个新的字幕模型，其中包括一个图像编码器来提取功能，将复发网络的混合物嵌入到一组单词中，以及将所获得的单词组合为样式句子的句子生成器。所得的系统，该系统具有重复专家（更多）的混合物（更多），它使用了一种新的训练算法，该算法从重复的神经网络（RNN）的加权矩阵中得出了奇异的价值分解（SVD）来增加字幕的多样性。每个分解步骤都取决于更多的RNN数量。由于使用的句子生成器提供了不配对图像的风格化语言语料库，因此我们的字幕模型可以做到这一点。此外，在未在标记或样式数据集上进行培训的情况下提取了样式和多样的字幕。为了验证此字幕模型，我们使用Microsoft Coco，这是标准的事实图像标题数据集。我们表明，所提出的字幕模型可以生成多样化的风格化图像标题，而无需超级标签。结果还显示了内容准确性的更好描述。

With great advances in vision and natural language processing, the generation of image captions becomes a need. In a recent paper, Mathews, Xie and He [1], extended a new model to generate styled captions by separating semantics and style. In continuation of this work, here a new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence. The resulted system that entitled as Mixture of Recurrent Experts (MoRE), uses a new training algorithm that derives singular value decomposition (SVD) from weighting matrices of Recurrent Neural Networks (RNNs) to increase the diversity of captions. Each decomposition step depends on a distinctive factor based on the number of RNNs in MoRE. Since the used sentence generator gives a stylized language corpus without paired images, our captioning model can do the same. Besides, the styled and diverse captions are extracted without training on a densely labeled or styled dataset. To validate this captioning model, we use Microsoft COCO which is a standard factual image caption dataset. We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling. The results also show better descriptions in terms of content accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题