Deep Visual-Semantic Alignments for Generating Image Descriptions 翻译

来源：互联网发布：淘宝网应用编辑：程序博客网时间：2024/06/06 01:05

AbstractWe present a model that generates natural language descriptions of images and their regions. 我们提出一个模型来生成图像及其区域的自然语言描述。Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data。我们的方法可以控制图像数据集和它们的句子描述，以了解语言和视觉数据之间的模式对应关系。Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. 我们的对齐模型基于图像区域上的卷积神经网络，句子上的双向递归神经网络以及通过多模式嵌入来对齐两种模态的结构化目标的新型组合。We then describe a Multimodal Recurrent Neural Network architec- ture that uses the inferred alignments to learn to generate novel descriptions of image regions.然后我们描述一个多模态递归神经网络结构，它使用推断的对齐来学习生成图像区域的新颖描述。We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. 我们证明我们的对齐模型可以在Flickr8K，Flickr30K和MSCOCO数据集上进行恢复实验。We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.然后，我们显示生成的描述在整幅图像和区域级注释的新数据集上均优于检索基线。

1. Introduction1.介绍A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene [14]. 快速浏览一幅图像足以让人指出并描述关于视觉场景的大量细节[14]。However, this remarkable ability has proven to be an elusive task for our visual recognition models. 然而，这种卓越的能力已被证明是我们的视觉识别模型难以捉摸的任务。The majority of previous work in visual recognition has focused on labeling images with a fixed set of visual categories and great progress has been achieved in these endeavors [45, 11].以前的大多数视觉识别工作都集中在用一组固定的视觉类别来标记图像，并且在这些工作中取得了很大的进展[45,11]。 However, while closed vocabularies of visual concepts constitute a convenient modeling assumption, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose.然而，虽然关于视觉概念的封闭词汇构成了一个方便的建模假设，但与人类可以构成的大量丰富的描述相比，它们是极其严格的限制性的。

Some pioneering approaches that address the challenge of generating image descriptions have been developed [29, 13]. 已经发明了一些开创性的方法来解决生成图像描述的挑战[29,13]。However, these models often rely on hard-coded visual concepts and sentence templates, which imposes limits on their variety.然而，这些模型通常依赖于硬编码的视觉概念和句子模板，这就对其种类施加了限制。Moreover, the focus of these works has been on reducing complex visual scenes into a single sentence, which we consider to be an unnecessary restriction.而且，这些作品的重点在于将复杂的视觉场景缩减为单个句子，我们认为这是一个不必要的限制。

这里写图片描述

Figure 1. Motivation/Concept Figure: Our model treats language as a rich label space and generates descriptions of image regions.图1.动机/概念图：我们的模型将语言视为丰富的标签空间，并生成图像区域的描述。In this work, we strive to take a step towards the goal ofgenerating dense descriptions of images (Figure 1). 在这项工作中，我们努力朝着生成图像的密集描述的目标迈进（图1）。The primary challenge towards this goal is in the design of a model that is rich enough to simultaneously reason about contents of images and their representation in the domain of natural language. 实现这一目标的主要挑战是设计一个足够丰富的模型，以同时推理图像的内容及其在自然语言领域的表现。Additionally, the model should be free of assumptions about specific hard-coded templates, rules or categories and instead rely on learning from the training data. 此外，该模型应该没有关于具体的硬编码模板，规则或类别的假设，而是依赖于从训练数据中学习。The second, practical challenge is that datasets of image captions are available in large quantities on the internet [21, 58, 37], but these descriptions multiplex mentions of several entities whose locations in the images are unknown.第二个实际的挑战是图像标题的数据集可以在互联网上大量使用[21,58,37]，但是这些描述复合了几个图像中位置未知的实体的提及。

Our core insight is that we can leverage these large image- sentence datasets by treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image. 我们的核心观点是，我们可以通过将句子视为弱标签来利用这些大型的图像 - 句子数据集，其中连续的单词片段对应于图像中的某个特定但未知的位置。Our approach is to infer these alignments and use them to learn a generative model of descriptions. Concretely, our contributions are twofold:我们的方法是推断这些对齐，并使用它们来学习描述的生成模型。 具体来说，我们的贡献是双重的：

We develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe.我们开发了一个深层的神经网络模型，该模型提供了句子之间的部分和他们描述的图像区域之间的潜在对齐。Our model associates the two modalities through a common, multimodal embedding space and a structured objective.我们的模型通过一个共同的多模式嵌入空间和一个结构化的目标将这两种模式联系起来。 We validate the effectiveness of this approach on image-sentence retrieval experiments in which we surpass the state-of-the-art.我们验证了这种方法在图像句子检索实验中的有效性，其中我们超越了现有技术。We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. 我们引入了一种多模态递归神经网络架构，它采用输入图像并在文本中生成其描述。Our experiments show that the generated sentences significantly outperform retrieval-based baselines, and produce sensible qualitative predictions. 我们的实验表明，生成的句子明显优于基于检索的基线，并产生明智的定性预测。We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations.然后，我们根据推断的对应关系对模型进行训练，并在一个新的地区级注释数据集上评估其性能。We make code, data and annotations publicly available. 我们公开提供代码，数据和注释。

2. Related Work2.相关工作Dense image annotations. 密集的图像注释。Our work shares the high-level goal of densely annotating the contents of images with many works before us.  我们的作品与我们面前的许多作品都有对图像内容进行密集注释的高级目标。Barnard et al. [2] and Socher et al. [48] studied the multimodal correspondence between words and images to annotate segments of images.巴纳德等[2]和Socher等[48]研究了单词和图像之间的多模态对应以注释图像的分段。Several works [34, 18, 15, 33] studied the problem of holistic scene understanding in which the scene type, objects and their spatial support in the image is inferred. 一些作品[34，18，15，33]研究了整体场景理解的问题，在这个问题中，图像中的场景类型，对象及其空间支持被推断出来。However, the focus of these works is on correctly labeling scenes, objects and regions with a fixed set of categories, while our focus is on richer and higher-level descriptions of regions.但是，这些作品的重点在于正确地标注具有固定类别的场景，对象和区域，而我们的重点是对区域的更丰富和更高级的描述。

Generating descriptions. 生成描述。The task of describing images with sentences has also been explored。用句子描述图像的任务也被探索。A number of approaches pose the task as a retrieval problem, where the most compatible annotation in the training set is transferred to a test image [21, 49, 13, 43, 23], or where training annotations are broken up and stitched together [30, 35, 31]. 许多方法将任务视为一个检索问题，其中训练集中最兼容的注释转移到测试图像上[21,49,13,43,23]，或者训练注释被分解的地方并缝合在一起[30,35,31]。Several approaches generate image captions based on fixed templates that are filled based on the content of the image [19, 29, 13, 55, 56, 9, 1] or generative grammars [42, 57], but this approach limits the variety of possible outputs. 有几种方法基于基于图像内容填充的固定模板[19,29,13,55,56,9,1]或生成语法[42,57]生成图像标题，但是这种方法限制了图像标题可能的产出。Most closely related to us, Kiros et al. [26] developed a logbilinear model that can generate full sentence descriptions for images, but their model uses a fixed window context while our Recurrent Neural Network (RNN) model conditions the probability distribution over the next word in a sentence on all previously generated words. Kiros等人与我们关系最为密切。[26]开发了一个对数线性模型，可以为图像生成完整的句子描述，但是他们的模型使用了一个固定的窗口上下文，而我们的循环神经网络（RNN）模型对已经生成的一个句子中的下一个单词中可能性分布进行了调整。Multiple closely related preprints appeared on Arxiv during the submission of this work, some of which also use RNNs to generate image descriptions [38, 54, 8, 25, 12, 5]. Arxiv在提交这项工作时出现了多个密切相关的预印本，其中一些还使用RNN生成图像描述[38,54,8,25,12,5]。Our RNN is simpler than most of these approaches but also suffers in performance. We quantify this comparison in our experiments.我们的RNN比大多数这些方法更简单，但也有性能受损。我们在我们的实验中量化这个比较。

Grounding natural language in images. 在图像中引入自然语言。A number of approaches have been developed for grounding text in the visual domain [27, 39, 60, 36]. 已经开发了许多方法来处理视域中的文字[27,39,60,36]。Our approach is inspired by Frome et al. [16] who associate words and images through a semantic embedding. 我们的方法受Frome等人的启发。[16]通过语义嵌入关联词和图像。More closely related is the work of Karpathy et al.[24], who decompose images and sentences into fragments and infer their inter-modal alignment using a ranking objective. 更密切相关的是Karpathy等人的工作。[24]，他们将图像和句子分解成片段，并使用排名目标来推断它们的模式间对齐。In contrast to their model which is based on grounding dependency tree relations, our model aligns contiguous segments of sentences which are more meaningful, interpretable, and not fixed in length.与他们的基于地面依赖树关系的模型相比，我们的模型对齐连续的句子段更有意义，而且是可解释的并且长度是不固定的。Neural networks in visual and language domains.视觉和语言领域的神经网络。Multiple approaches have been developed for representing images and words in higher-level representations. 已经开发了多种方法来表示更高级别表示中的图像和文字。On the image side, Convolutional Neural Networks (CNNs) [32, 28] have recently emerged as a powerful class of models for image classification and object detection [45]. 在图像方面，卷积神经网络（CNNs）[32,28]最近已经成为一个强大的图像分类和目标检测模型[45]。On the sentence side, our work takes advantage of pretrained word vectors [41, 22, 3] to obtain low-dimensional representations of words. 在公共方面，我们的工作利用预训练词向量[41,22,3]来获得词的低维表示。Finally, Recurrent Neural Networks have been previously used in language modeling [40, 50], but we additionally condition these models on images.最后，递归神经网络已经被用于语言建模[40,50]，但是我们在图像上另外调整这些模型。

3. Our Model3.我们的模型Overview. The ultimate goal of our model is to generate descriptions of image regions.概述。我们模型的最终目标是生成图像区域的描述。During training, the input to our model is a set of images and their corresponding sentence descriptions (Figure 2). 在训练期间，我们模型的输入是一组图像及其相应的句子描述（图2）。We first present a model that aligns sentence snippets to the visual regions that they describe through a multimodal embedding. 我们首先提出一个模型，将句子片段与通过多模式嵌入描述的视觉区域对齐。We then treat these correspondences as training data for a second, multimodal Recurrent Neural Network model that learns to generate the snippets.然后，我们把这些对应关系作为第二个多模态递归神经网络模型的训练数据，学习生成片段。3.1. Learning to align visual and language data3.1 学习调整视觉和语言数据Our alignment model assumes an input dataset of images and their sentence descriptions. 我们的对齐模型假定了图像的输入数据集和它们的句子描述Our key insight is that sentences written by people make frequent references to some particular, but unknown location in the image. 我们的主要观点是，人们写的句子经常提到一些特定但未知的图像位置。For example, in Figure 2, the words “Tabby cat is leaning” refer to the cat, the words “wooden table” refer to the table, etc.例如，在图2中，“虎斑猫正在倾斜”一词指的是猫，“木桌”这个词是指桌子等 We would like to infer these latent correspondences, with the eventual goal of later learning to generate these snippets from image regions. 我们想要推断这些潜在的对应关系，最终的目标是后面的学习从图像区域生成这些片段。We build on the approach of Karpathy et al. [24], who learn to ground dependency tree relations to image regions with a ranking objective. 我们建立在Karpathy等人的方法上。他们学习将依赖树关系映射到具有排名目标的图像区域。Our contribution is in the use of bidirectional recurrent neural network to compute word representations in the sentence, dispens- ing of the need to compute dependency trees and allowing unbounded interactions of words and their context in the sentence. 我们的贡献在于使用双向递归神经网络来计算句子中的单词表示，因为需要计算依赖性树，并且允许单词和它们在句子中的上下文无限的相互作用。We also substantially simplify their objective and show that both modifications improve ranking performance.我们也大大简化了他们的目标，并显示这两个修改提高排名表现。

We first describe neural networks that map words and image regions into a common, multimodal embedding. 我们首先描述将单词和图像区域映射成共同的多模式嵌入的神经网络。Then we introduce our novel objective, which learns the embedding representations so that semantically similar concepts across the two modalities occupy nearby regions of the space.然后，我们介绍我们的创新目标，它学习嵌入表示，使两个模式中的语义相似的概念占据空间的附近区域。3.1.1 Representing images3.1.1 代表图像Following prior work [29, 24], we observe that sentence de- scriptions make frequent references to objects and their at- tributes.继之前的工作[29,24]，我们观察到，句子描述经常提及物体及其属性。Thus, we follow the method of Girshick et al.  因此，我们遵循Girshick等人的方法。[17] to detect objects in every image with a Region Convolu- tional Neural Network (RCNN). [17]用区域卷积神经网络（RCNN）检测每个图像中的物体。The CNN is pretrained on ImageNet [6] and finetuned on the 200 classes of the ImageNet Detection Challenge [45].CNN在ImageNet上进行了预先训练[6]，并对ImagNet Detection Challenge [200]的200个类进行了调整。 Following Karpathy et al. [24], we use the top 19 detected locations in addition to the whole image and compute the representations based on the pixels Ib inside each bounding box as follows:继Karpathy等人 [24]，除了整个图像，我们使用前19个检测位置，并计算基于每个边界框内的像素Ib的表示如下：

这里写图片描述

where CNN(Ib) transforms the pixels inside bounding box Ib into 4096-dimensional activations of the fully connected layer immediately before the classifier. 其中CNN（Ib）在分类器之前立即将边界框Ib内的像素转换成完全连接层的4096维激活。 The CNN parameters ✓c contain approximately 60 million parameters. The matrix Wm has dimensions h * 4096, where h is the size of the multimodal embedding space (h ranges from 1000-1600 in our experiments). CNN参数✓c包含大约6000万参数。 矩阵Wm的尺寸 h⇥4096，其中 h 是多峰嵌入空间的大小（h在我们的实验中在1000-1600之间）。 Every image is thus represented as a set of h-dimensional vectors {vi | i = 1...20}.因此，每个图像被表示为一组h维向量{vi | i = 1 ... 20}。

3.1.2 Representing sentences3.1.2代表句子To establish the inter-modal relationships, we would like to represent the words in the sentence in the same h-dimensional embedding space that the image regions occupy. 为了建立模式间的关系，我们希望在句子中用相同的h维嵌入空间表示图像区域。The simplest approach might be to project every individual word directly into this embedding.最简单的方法可能是将每个单词直接投影到这个嵌入中。 However, this approach does not consider any ordering and word context information in the sentence.但是，这种方法不考虑句子中的任何排序和词语环境信息。 An extension to this idea is to use word bigrams, or dependency tree relations as pre- viously proposed [24].这个思想的扩展是使用前面提出的词“bigrams”或依赖关系树关系[24]。However, this still imposes an arbitrary maximum size of the context window and requires the use of Dependency Tree Parsers that might be trained on unrelated text corpora.然而，这仍然强加上下文窗口的任意最大大小，并且需要使用可能在不相关的文本语料库上训练的依赖树解析器。To address these concerns, we propose to use a Bidirec- tional Recurrent Neural Network (BRNN) [46] to compute the word representations. 为了解决这些问题，我们建议使用双向递归神经网络（BRNN）[46]来计算词语表示。The BRNN takes a sequence ofN words (encoded in a 1-of-k representation) and transforms each one into an h-dimensional vector.BRNN需要一系列的N个词（以1的k表示编码）并将每个词转换成h维向量。 However, the representation of each word is enriched by a variably-sized context around that word. 然而，每个单词的表示都是围绕这个单词的变化大小来加以丰富的。Using the index t = 1 . . . N to denote the position of a word in a sentence, the precise form of the BRNN is as follows:使用索引t = 1。 。 。 N表示一个单词在句子中的位置，BRNN的确切形式如下：

这里写图片描述

Here,  t is an indicator column vector that has a single one at the index of the t-th word in a word vocabulary. 这里，t是在单词词汇表中第t个单词的索引处具有唯一一个的指标列向量。The weights Ww specify a word embedding matrix that we ini- tialize with 300-dimensional word2vec [41] weights and keep fixed due to overfitting concerns.权重ww指定一个字嵌入矩阵，我们用300维的word2vec [41]权重初始化，并由于过度拟合的担忧而保持固定。However, in practice we find little change in final performance when these vectors are trained, even from random initialization. 然而，实际上，当这些向量被训练时，即使从随机初始化，我们也发现最终性能几乎没有变化。Note that the BRNN consists of two independent streams of pro- cessing, one moving left to right (hft ) and the other right to left (hbt ) (see Figure 3 for diagram).注意，BRNN由两个独立的处理流组成，一个从左到右（hft），另一个从右到左（hbt）（见图3的图）。 The final h-dimensional representation st for the t-th word is a function of both the word at that location and also its surrounding context in the sentence.第t个单词的最终h维表示st是句子中该单词和其周围上下文的函数。Technically, every st is a function of all words in the entire sentence, but our empirical finding is that the final word representations (st) align most strongly to the visual concept of the word at that location ( t).从技术上讲，每个st都是整个句子中所有单词的函数，但是我们的经验发现是，最终的单词表示（st）与该单词在该位置（t）的视觉概念最为一致。We learn the parameters We , Wf , Wb , Wd and the respective biases be, bf , bb, bd.我们学习参数We，Wf，Wb，Wd和相应的偏差为bf，bb，bd。A typical size of the hidden rep- resentation in our experiments ranges between 300-600 dimensions. We set the activation function f to the rectified linear unit (ReLU), which computes f : x -> max(0, x).在我们的实验中隐藏代表的典型大小在300-600维之间。我们将激活函数f设置为整流线性单元（ReLU），计算f：x 7！ max（0，x）。

3.1.3 Alignment objective3.1.3对准目标We have described the transformations that map every image and sentence into a set of vectors in a common h-dimensional space. 我们已经描述了将每个图像和句子映射到共同的h维空间中的一组向量的转换。Since the supervision is at the level of entire images and sentences, our strategy is to formulate an image-sentence score as a function of the individual region- word scores. 由于监督是在整个图像和句子的水平，我们的策略是制定一个图像-句子分数作为个别地区单词分数的函数。Intuitively, a sentence-image pair should have a high matching score if its words have a confident support in the image.直观地说，如果一个句子在图像中有一个自信的支持，那么它应该有一个高的匹配分数。The model of Karpathy et a. [24] interprets the dot product viT st between the i-th region and t-th word as a measure of similarity and use it to define the score between image k and sentence l as:Karpathy等人的模型[24]将第i个区域和第t个单词之间的点积viT st解释为相似性度量，并用它来定义图像k和句子l之间的分数为：

这里写图片描述

Here, gk is the set of image fragments in image k and gl is the set of sentence fragments in sentence l.这里，gk是图像k中的图像片段集合，而gl是句子l中的句子片段集合。The indices k, l range over the images and sentences in the training set.指数k，l在训练集中的图像和句子的范围内。Together with their additional Multiple Instance Learning objective, this score carries the interpretation that a sentence fragment aligns to a subset of the image regions whenever the dot product is positive.再加上他们的多重实例学习目标，这个分数的解释就是，只要点积是正的，句子片段就与图像区域的一个子集对齐。We found that the following reformulation simplifies the model and alleviates the need for additional objectives and their hyperparameters:我们发现，下面的修改简化了模型，减少了对额外目标和超参数的需求：

这里写图片描述

阅读全文

0 0