DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition 一般视觉识别的深度卷积刺激特征

来源：互联网发布：unity3d塔防游戏源码编辑：程序博客网时间：2024/05/16 08:48

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

一般视觉识别的深度卷积刺激特征

Abstract

摘要

We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novelgeneric tasks.

Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt adeep architecture to the new tasks.

We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, includingscene recognition, domain adaptation, and fine-grained recognition challenges.

We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform thestate-of-the-art on several important vision challenges.

We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms

我们评估从一个深度卷积网络的激活中提取的特征是否可以重新用于新的通用任务。

深度卷积网络在一个大的，固定的对象识别任务上以完全监督的方式训练，

我们的通用任务可能与原来训练的任务明显不同，并且可能没有足够的标记或未标记的数据来常规地训练或适应新任务的深度架构。

我们研究并可视化深层卷积特征的语义聚类，关于各种这样的任务，包括场景识别，领域适应和细粒度识别挑战。

我们比较依赖各种网络级的效果定义一个固定特征，并报告新颖的结果，显著优于几个重要的视觉挑战的最高水准的。

我们发布了DeCAF，这是一个开放源代码，实现了这些深层卷积激活特性，以及所有相关的网络参数，使视觉研究人员能够在一系列视觉概念学习范例中进行深层表征的实验

1. Introduction

Discovery of effective representations that capture salient semantics for a given task is a key goal of perceptual learning.

Performance with conventional visual representations, based on flat feature representations involving quantized gradient filters, has been impressive but has likely plateaued in recent years.

It has long been argued that deep or layered compositional architectures should be able to capture salient aspects of a given domain through discovery of salient clusters, parts, mid-level features, and/or hidden units (Hinton & Salakhutdinov, 2006; Fidler & Leonardis, 2007; Zhu et al., 2007; Singh et al., 2012; Krizhevsky et al., 2012).

Such models have been able to perform better than traditional hand-engineered representations in many domains, especially those where good features have not already been engineered (Le et al., 2011).

Recent results have shown that moderately deep unsupervised models outperform the state-of-the artgradient histogram features in part-based detection models (Ren & Ramanan, 2013).

发现为给定任务捕获显著语义的有效表示是感知学习的关键目标。

基于涉及量化梯度滤波器的平坦特征表示的常规视觉表示的性能已经令人印象深刻，但近年来可能稳定。

长期以来一直认为，深层或层次化的组成架构应该能够通过发现显着的簇，部分，中间层特征和/或隐藏单元来捕获给定域的显着方面（Hinton＆Salakhutdinov，2006; Fidler＆Leonardis，2007; Zhu et al。，2007; Singh et al。，2012; Krizhevsky et al。，2012）。

这样的模型已经能够在许多领域中比传统的手工工程表示更好地执行，特别是那些尚未设计良好特征的模型（Le等人，2011）。

最近的结果表明，基于部分的检测模型中，中等深度的无监督模型优于的最先进的梯度直方图特征（Ren和Ramanan，2013）。

Deep models have recently been applied to large-scale visual recognition tasks, trained viaback-propagation through layers of convolutional filters (LeCun et al., 1989).

These models perform extremely well in domains with large amounts of training data, and had early success in digit classification tasks (LeCun et al., 1998).

With the advent of large scale sources of category-level training data, e.g., (Deng et al., 2009), and efficient implementation withon-line approximate model averaging (“dropout”) (Krizhevsky et al., 2012), they have recently outperformed all known methods on a large scale recognition challenge (Berg et al., 2012).

With limited training data, however, fully-supervised deep architectures with the representational capacity of (Krizhevsky et al., 2012) will generally dramatically overfit the training data.

In fact, many conventional visual recognition challenges have tasks with few training examples; e.g., when a user is defining a category“on-the-fly”using specific examples, or for fine-grained recognition challenges (Welinder et al., 2010), attributes (Bourdev et al., 2011), and/or domain adaptation (Saenko et al., 2010).

In this paper we investigate semi-supervised multi-task learning of deep convolutional representations, whererepresentations are learned on a set of related problems but applied to new tasks which have too few training examples to learn a full deep representation.

Our model can either be considered as a deep architecture for transfer learning based on a supervised pre-training phase, or simply as a new visual feature DeCAF defined by the convolutional network weights learned on a set of pre-defined object recognition tasks. Our work is also related to representation learning schemes in computer vision which form an intermediate representation based on learning classifiers on related tasks (Li et al., 2010; Torresani et al., 2010; Quattoni et al., 2008)

深度模型最近已经应用于大规模视觉识别任务，通过卷积滤波器层的反向传播进行训练（LeCun等人，1989）。

这些模型在具有大量训练数据的领域中表现非常好，并且在数字分类任务中取得了早期成功（LeCun等人，1998）。

随着类别级训练数据的大规模来源（例如，Deng等人，2009年）和在线近似模型平均（“中断”）的高效实施（Krizhevsky等人，2012）的出现，最近在大规模识别挑战上胜过所有已知的方法（Berg等人，2012）。

然而，有限的训练数据，具有代表能力（Krizhevsky等，2012）的完全监督的深层架构通常会显着地过拟合训练数据。

事实上，许多常规的视觉识别挑战任务有很少的训练示例;例如，当用户定义特定示例“在空中飞行”或者对于细粒度识别挑战（Welinder等人，2010），属性（Bourdev等人，2011）和/或领域适应（Saenko等人，2010）。

在本文中，我们研究深层卷积表示的半监督多任务学习，其中表达法在一组相关问题上学习，但应用于具有太少的训练示例的新任务，以学习完全深度表示。

我们的模型可以被认为是基于监督的预训练阶段的迁移学习的深层架构，或者简单地作为由在一组在预定义的对象识别任务上学习的卷积网络权重定义的新的视觉特征DeCAF。

我们的工作还涉及计算机视觉中的表示学习方案，其形成基于相关任务的学习分类器的中间表示（Li等人，2010; Torresani等人，2010; Quattoni等人，2008）

Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on ImageNet outperforms a host of conventional visual representations onstandard benchmark object recognition tasks, including Caltech-101 (Fei-Fei et al., 2004), the Office domain adaptation dataset (Saenko et al., 2010), the Caltech-UCSD Birds fine-grained recognition dataset (Welinder et al., 2010), and the SUN-397 scene recognition database (Xiao et al., 2010).

Further, we analyze the semantic salience of deep convolutional representations, comparing visual features defined from such networks to conventional representations.

In Section 3, we visualize the semantic clustering properties of deep convolutional features compared to baseline representations, and find that convolutional features appear to cluster semantic topics more readily than conventional features.

Finally, while conventional deep learning can be computationally expensive, we note that the run-time and resource consumption of deep-learned convolutional features are not exceptional, compared with features such as HOG (Dalal & Triggs, 2005) or KDES (Bo et al., 2010).

我们的主要结果是经验验证，基于在ImageNet上的卷积网络权重训练的通用视觉特征优于标准基准对象识别任务的大量传统视觉表示，包括Caltech-101（Fei-Fei等人，2004）Office领域适应数据集（Saenko等人，2010），Caltech-UCSD鸟细粒识别数据集（Welinder等人，2010）和SUN-397场景识别数据库（Xiao等人，2010）。

此外，我们分析深层卷积表示的语义显着性，将从这样的网络定义的视觉特征与常规表示相比较。

在第3节中，我们可视化深层卷积特征的语义聚类属性与基线表示相比，发现卷积特征似乎比传统特征更容易集中语义主题。

最后，虽然传统的深度学习在计算上可能是昂贵的，但是我们注意到，与诸如HOG（Dalal＆Triggs，2005）或KDES（Bo等人）的特征相比，深层学习的卷积特征的运行时间和资源消耗不是例外。，2010）。

2. Related work

Deep convolutional networks have a long history in computer vision, with early examples showing successful results on using supervised back-propagation networks to perform digit recognition (LeCun et al., 1989).

More recently, these networks, in particular the convolutional network proposed by Krizhevsky et al. (2012), have achieved competition-winning numbers on large benchmark datasets consisting of more than one million images, such as ImageNet (Berg et al., 2012). Learning from related tasks also has a long history in machine learning beginning with Caruana (1997) and Thrun (1996).

Later works such as Argyriou et al. (2006) developed efficient frameworks for optimizing representations from related tasks, and Ando & Zhang (2005) explored how to transferparameter manifolds to new tasks.

In computer vision, forming a representation based on sets of trained classifiers on related tasks has recently been shown to be effective in a variety of retrieval and classification settings, specifically using classifiers based on visual category detectors (Torresani et al., 2010; Li et al., 2010).

A key question for such learning problems is to find a feature representation that captures the object category related information while discarding noise irrelevant to object category information such as illumination.

Transfer learning across tasks using deep representations has been extensively studied, especially in an unsupervised setting (Raina et al., 2007; Mesnil et al., 2012).

However, reported successes with such models in convolutional networks have been limited to relatively small datasets such as CIFAR and MNIST, and efforts on larger datasets have had only modest success (Le et al., 2012).

We investigate the “supervised pre-training”approach proven successful in computer vision and multimedia settings using a concept-bank paradigm (Kennedy & Hauptmann, 2006; Li et al., 2010; Torresani et al., 2010) by learning the features on large-scale data in a supervised setting, then transferring them to different tasks with different labels.

To evaluate the generality of a representation formed from a deep convolutional feature trained on generic recognition tasks, we consider training and testing on datasets known to have a degree of dataset bias with respect to ImageNet.

We evaluate on the SUN-397 scene dataset, as well as datasets used to evaluate domain adaptation performance directly (Chopra et al., 2013; Kulis et al., 2011).

This evaluates whether the learned features could undo the domain bias by capturing the real semantic information instead of overfitting to domain-specific appearances.

深度卷积网络在计算机视觉方面有悠久的历史，早期的例子显示使用监督反向传播网络执行数字识别的成功结果（LeCun等，1989）。

最近，这些网络，特别是由Krizhevsky等人提出的卷积网络（2012），在超过一百万个图像组成的大型基准数据集上获得了竞争获胜的数字，如ImageNet（Berg等人，2012）。

从相关任务学习在机器学习方面也有很长的历史，从Caruana（1997）和Thrun（1996）开始。

后来的工作，如Argyriou et al。（2006）开发了用于优化相关任务的表示的有效框架，Ando和Zhang（2005）探讨了如何将参数复写传递到新任务。

在计算机视觉中，基于相关任务上的已训练的分类器形成的表示法最近已被证明在各种检索和分类设置中是有效的，特别是使用基于视觉类别检测器的分类器（Torresani等人，2010; Li et al。et al。，2010）。

这种学习问题的关键问题是找到与捕获对象类别信息相关的特征表示，同时丢弃与对象类别信息无关的噪声，例如照明。

使用深度表示的跨任务转移学习已经广泛研究，特别是在无监督的环境中（Raina等人，2007; Mesnil等人，2012）。

然而，在卷积网络中的这种模型的报告成功仅限于相对小的数据集，例如CIFAR和MNIST，并且对较大数据集的努力仅取得了适度的成功（Le等人，2012）。

我们调查“监督预训练”在计算机视觉和多媒体环境中证明是成功的，通过使用概念银行范式（Kennedy＆Hauptmann，2006; Li et al。，2010; Torresani et al。，2010），在大规模数据上学习特征，然后将它们转移到具有不同标签的不同任务。

为了评估形成的表示法的一般性，（由在通用识别任务上训练的深度卷积特征形成）我们考虑在已知相对于ImageNet具有一定程度的数据集偏差的数据集上训练和测试。

我们在SUN-397场景数据集上评估，也用它直接评估领域适应性能（Chopra等人，2013; Kulis等人，2011）。

这评估了所学习的特征是否可以通过捕获真实语义信息而不是过拟合到领域特定的外观来撤消域偏差。

3. Deep Convolutional Activation Features

In our approach, a deep convolutional model is first trained in a fully supervised setting using a state-of-the-art method Krizhevsky et al. (2012).

We then extract various features from this network, and evaluate the efficacy of these features on generic vision tasks.

While the forward pass computed by the architecture discussed in this section does achieve state-of-the-art performance on ILSVRC-2012, at least two important questions remain:

Do features extracted from the CNN generalize to other datasets?

How does performance vary with network depth?

We address these questions both qualitatively and quantitatively, via visualizations of semantic clusters below, and experimental comparision to current baselines in the following section.

在我们的方法中，首先使用最先进的方法在完全监督的设置中训练深卷积模型Krizhevsky et al。（2012）。

然后，我们从这个网络中提取各种功能，并评估这些功能对通用视觉任务的功效。

虽然由本节讨论的架构计算的正向通道确实实现了ILSVRC-2012的最先进的性能，但至少有两个重要问题仍然存在：

从CNN提取的特征是否推广到其他数据集？

性能如何随着网络深度而变化？

我们通过下面的语义聚类的可视化以及在下一节中对当前基线的实验比较来定性和定量地解决这些问题。

0 0