Object Detection:R-CNN、Fast R-CNN、Faster R-CNN论文阅读笔记

来源：互联网发布：小米平板2windows版编辑：程序博客网时间：2024/05/21 00:47

1.引言

Ross Girshick(rbg大神）2014年提出R-CNN架构，可谓给object detection领域一个里程碑的前进，在此之前object detection性能已经好些年没有大的提高了。本文是笔者阅读R-CNN系列文章的学习笔记。

2.R-CNN部分

2.1 R-CNN介绍

R-CNN combines two key insights:
(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost.

R-CNN系统整体架构如下：
这里写图片描述
R-CNN具体网络架构如下：

基本流程：
(1):Input image(任意大小的)

(2):提取region proposals（或者说RoI,Regions of interest)，2000个左右，采用的Region proposal算法是Selective Search.

(3):Warped image regions,将每个图像层的region proposal缩放到大小为227×227,然后输入到CNN网络（比如VGG16、VGG19等),利用CNN从每个region proposal提取出一个固定长度的feature vector,这里的CNN网络通常在Imagenet大数据集上做个预训练，后期再进行微调Fine-tuning。
Note:这里吧regions resize到固定大小是因为CNN网络中fc6,fc7是全连接层，需要固定输入。

(4):CNN网络的pool5、fc6和fc7的输出都可以作为特征(关于选择哪个层做特征输出，论文中有详细分析)，最后输入到SVM进行分类。

2.2 R-CNN优缺点

2.2.1 优点

(1):在200-class ILSVRC2013 deteciton dataset性能超越了OverFeat(31.4% vs 24.3%,ILSVRC2013 detection).
(2):Our system is also quite efficient. The only class-specific
computations are a reasonably small matrix-vector product
and greedy non-maximum suppression. This computational
property follows from features that are shared across all categories and that are also two orders of magnitude lowerdimensional than previously used region features。

2.2.2 缺点

(1)训练复杂：多阶段训练
R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
I.预训练CNN网络(Supervised pre-trianing)首先在大数据集上（如ILSVRC2012)进行与训练。
II.特定区域微调(Domain-specific fine-tuning).调整我们的CNN网络适应new task(detection) 和new domain(warped proposal windows),利用(N+1)−way classification layer(此处N是object classes,为background加1）代替CNN’s ImageNet-specific 1000-way classification layer。
III.SVM object分类器训练
IV.Bounding-box regression训练
(2)空间时间资源开销大：
Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
(3)速度慢：
训练非常慢(84h)、Inference慢。At test-time, features are extracted from each object proposal in each test image.Detection with VGG16 takes 47s / image (on a GPU)

3.Fast R-CNN部分

针对在2.2.2中描述的R-CNN的缺点,rbg大神在2015年独自一人搞出了Fast R-CNN(在R-CNN和SPPnet的基础上)，相比R-CNN，不仅提高了训练速度和测试速度，还提高了detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate.

3.1 Fast R-CNN介绍

1)Tesing阶段网络架构：
这里写图片描述

Testing阶段工作流程：

给定一张输入图片，首先通过Region Proposal方法(比如selective search、edge box)得到感兴趣区域RoI.然后把该image和它的一系列RoI输入到卷积网络ConvNet， ConvNet的最后一层输出feature map输入到RoI Pooling layer(它其实是SPPnet的一个简化版，只不过是SPPnet是多个level构成的金字塔，而RoI Pooling layer是一个level的单层金字塔），RoI Pooling layer输出固定长度的特征向量H×W(这里的H、W是超参数，Fast R-CNN都设置为7）,然后输入到softmax层输出一个概率分布(离散的)，在每个类上的概率；同时把改固定长度的特征向量输入到bbox回归的全链接层，给所有的类都输出一个bounding box regression offsets,tk=(tkx,tky,tkw,tkh),k=0,1,2,...,classnum,比classnum多1是因为还有个background类。

2)Training阶段网络架构：
这里写图片描述
3)Fast R-CNN系统整体架构：

我们知道R-CNN慢的主要原因是要对每个region proposal做卷积，没有sharing computation。Fast R-CNN采用了SPPnets(Spatial pyramid pooling networks)的思想去加速R-CNN，其实就是通过sharing computation. 关于SPPnet读者可参看该文最后的附录A。

SPPnet其实也有显著的缺点，像R-CNN一样，训练是多阶段的，包括extracting features,fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors,特征先要写到磁盘。并且SPPnet网络Fine-tuning时，不能更新在Spatial pyramid pooling之前的卷积网络层参数，原因是每个RoI有很大的receptive field,经常是整个图了。

Fast R-CNN网络架构具体描述：
(1)RoI pooling layer
它实际上是一个SPPnet的一个简单版(只有一个pyramid level)，在SPPnet中对每个Region proposal使用了多个不同大小的金字塔映射，而Fast R-CNN中的RoI pooling layer层采用一个固定空间范围的（比如H×W=7×7)max pooling将任何有效的RoI里面的特征转换成a small feature map.

In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r; c; h; w) that specifies its top-left corner (r; c) and its height and width (h; w).

(2)Initializing from pre-trained networks

When a pre-trained network initializes a Fast R- CNN network, it undergoes three transformations.
I. the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).
II. the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).
III. the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

(3)多任务损失(multi-task loss)
将Lcls 和Lloc加到一块，形成一个端到端的训练。
对于bounding-box regression,我们定义损失函数如下：

L l o c (t u, v) = \sum i \in {x, y, w, h} s m o o t h L 1 (t u i - v i)

此处，

smoothL1(x)={0.5x2|x|−1if |x|<1 otherwise
(4)加速技巧
将一个全连接层拆分成两个全连接层，不但可以进行网络压缩，当RoI数量比较多的时候还可以加速。
Truncated SVD for faster detection
For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers. Large fully connected layers are easily accelerated by compressing them with truncated SVD.权值矩阵

W可以做如下等价：

W \approx U σ V T

3.2 与R-CNN对比

首先R-CNN训练是分三个阶段，而Fast R-CNN直接使用Softmax代替SVM分类（实验表明使用Softmax性能还要稍微好一些，比起SVM），同时利用多任务损失函数将Bounding box回归也加入到网络中。此外，Fast R-CNN在网络微调过程中，将部分卷积层也进行了微调，取得了更好的性能。

3.3 Fast R-CNN优点和缺点

3.3.1 优点

1) 更高的检测质量(mAP)，比起R-CNN和SPPnet
2) 训练是单阶段的，利用多任务损失函数
3)训练可以更新整个网络的所有参数
4)不需要磁盘空间去缓存特征

3.3.2缺点

存在瓶颈：选择性搜索算法(Selective Search)，找出所有的候选框（2000个左右），非常浪费。当然针对这个缺点rbg大神又找到一帮子人在2016年又搞出了个Faster R-CNN.

4.Faster R-CNN部分

针对3.3.2中阐述的Fast R-CNN的缺点，即Region proposal算法(SS,Selective Search)是导致Fast R-CNN慢的一个瓶颈(a bottleneck)。rbg大神又找来一帮大神在2015年搞出了Faster R-CNN v1，然后在2016年搞出个Faster R-CNN v3网络。在版本v3中，引入了一个Region Proposal Network(RPN),它与the detection network共享整个输入图像的卷积特征，使得Region proposal几乎没有什么开销。将RPN和Fast R-CNN合并成一个单一网络，二者共享卷积特征，RPN类似attention机制，告诉整个网络往哪里看。实验表明每张图像的Region proposal大约300个，比起R-CNN,Fast R-CNN确实少啦不少。

通过在top of convolutional feature maps 增加一些额外的卷积层构建RPN，RPN网络是一个全卷积网络(FCN). RPN与Fast R-CNN共享卷积特征，卷积特征不仅可以用来做region-base detectors,而且还可以用来generating region proposals。

4.1 Faster R-CNN网络架构v3

这里写图片描述

RPN是一个全卷积网络，它同时预测object bounds和在每个位置的objectness scores. 也就是说卷积特征不仅可以被检测网络使用，像Fast R-CNN,也是可以被用来产生Region proposals.在CNN之后通过添加一些卷积层（它同时回归区域边界和对象得分）来构建RPN网络。
这里写图片描述