视觉问题回答的语义感知模块化胶囊路由

论文标题

视觉问题回答的语义感知模块化胶囊路由

Semantic-aware Modular Capsule Routing for Visual Question Answering

论文作者

Han, Yudong, Yin, Jianhua, Wu, Jianlong, Wei, Yinwei, Nie, Liqiang

论文摘要

视觉问题回答（VQA）本质上是从根本上组成的，许多问题仅通过将它们分解为模块化的子问题就可以回答。最近提出的神经模块网络（NMN）采用此策略来问答案，而在现成的布局解析器或有关网络体系结构设计的其他专家政策中，而不是从数据中学习。这些策略导致对输入的语义复杂差异的适应性不令人满意，从而阻碍了模型的表示能力和概括性。为了解决这个问题，我们提出了一个语义吸引的模块化胶囊路由框架（称为Super），以更好地捕获特定实例的视觉 - 语义特征并完善预测的歧视性表示。特别是，在超级网络的每一层中都定制了五个功能强大的专用模块以及动态路由器，并构造了紧凑的路由空间，以便可以充分利用各种可自定义的路由，并且可以明确校准视觉语义表示。我们相对证明，我们提出的超级方案在五个基准数据集以及参数效率优势上的有效性和概括能力合理。值得强调的是，这项工作不是在VQA中追求最先进的结果。取而代之的是，我们希望我们的模型有责任为VQA提供建筑学习和表示校准的新颖观点。

Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE Routing framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction. Particularly, five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network, and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets, as well as the parametric-efficient advantage. It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA. Instead, we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题