论文笔记: Hierarchical Question-Image Co-Attention for Visual Question Answering

来源：互联网发布：开淘宝网店难吗编辑：程序博客网时间：2024/05/16 09:28

Hierarchical Question-Image Co-Attention for Visual Question Answering
JiasenLu∗,JianweiYang∗,DhruvBatra∗† ,DeviParikh∗† ∗Virginia Tech,†Georgia Institute of Technology {jiasenlu, jw2yang, dbatra, parikh}@vt.edu
Abstract
A number of recent works have proposed attention models for Visual Question Answering(VQA)thatgeneratespatialmapshighlightingimageregionsrelevantto answeringthequestion. Inthispaper,wearguethatinadditiontomodeling“where to look” or visual attention, it is equally important to model “what words to listen to” or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).Ourmodelimprovesthestate-of-the-artontheVQAdatasetfrom60.3%to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.1.

arXiv:1606.00061v5 [cs.CV] 19 Jan 2017

VQA中的注意力模型一般是生成空间映射以突出问题（文本）和图像区域之间的关联关系。

本文提出了除了视觉注意力即"看那里"，问题注意力即"听哪个词"也同等重要。

本文针对VQA提出了一种“协同注意力”模型联合的推理图像和文字注意力。

此外，本文的模型通过1D CNN有层次的推理问题（通过协同注意力，也包括图像推理）。

协同注意力：在问题(文本)和图像之间有自然的对称性，即问题（文本）的表征可以用来引导图像注意力，反之图像的表征可以用来引导文本注意力

问题（文本）层次: 本文使用三个层面来协同关注问题（文本）和图像，分别是词，短语，句子（问题）

词层面，用词嵌入表示

短语层面，使用1-D CNN获取unigrams, bigrams, trigrams中的信息，具体是，本文对词层面的表征采用不同过滤器卷积，对ngrams的输出响应进行池化组合形成一个简单的
短语层面的表征

问题层面，本文使用RNN来编码整个问题（文本），

在这个架构中的问题（文本）表征每个层面，本文构建了连接问题（文本）和图像的协同注意力映射，随后它们递归的组合起来最终预测答案的分布。

两种协同注意力: parallel ， alternating

0 0