深度学习之检测模型-Mask RCNN

来源：互联网发布：mac视频播放器mpv 编辑：程序博客网时间：2024/05/22 01:51

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

思想

基于Faster RCNN框架，在最后同分类和回归层，增加了实例分割任务【a small FCN applied to each RoI】
将Faster RCNN中的RoI Pooling替换成RoI Align操作
最终的特征层，采用FPN(Feature Pyramid Network)进行特征提取
采用ResNet101作为基础网络
RPN中的anchor采用5 scales和3 aspect ratios

注：

实例分割
- Mask RCNN在FasterRCNN最后扩展了分类和回归任务，增加了一个针对每一个RoI区域的分割任务。该任务是一个简单的FCN网络。
RoIAlign操作
- 因为RoIPool操作，太过于粗暴，导致特征层与原始图像上的对应关系误差太大【这是Fast/Faster R-CNN的主要问题】，所以提出了RoIAlign操作，可以保留空间位置的精度【preserves exact spatial locations】
- 该操作，非常只是修改了一点点，但是作用非常大，能够提高大概10%～50%的分割精度
解耦合
- 将分割任务和分类任务解耦合
  - RoI classification分支进行分类预测
  - FCN进行像素级别的多类别分类预测【分割】，其包括分割和分类两方面任务。
    - 最终FCN输出一个K层的mask，每一层为一类，Log输出，用0.5作为阈值进行二值化，产生背景和前景的分割Mask
灵活性
- 框架经过非常小的改动后，可以进行human pose estimation
- 将人体的每一个keypoint作为一个类别进行训练和检测
时间
- 该算法因为在Faster RCNN上增加一个非常小的任务，计算量增加的非常小，从而可以达到5fps的速度

Mask R-CNN

Faster R-CNN has two outputs for each caniateobject, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask.
Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object.

RoI Loss:

L = L c l s + L b o x + L m a s k

其中：

Lcls为分类损失
Lbox为bounding-box回归损失，同Fast R-CNN
Lmask为实例分割损失
- 输出大小为：K∗m2，其中K为类别数量，m表示RoI Align特征图的大小
- 对每一个像素应用sigmoid，然后取RoI上所有像素的交叉熵的平均值作为Lmask
- 反向传播：Loss只对ground-truth那一层进行计算和反响传播。该操作有效避免了类别竞争，也使得分割和分类解耦合【作者实验也证明了这种解耦合有一定的作用[参考Table 2b]，之前的分割任务如FCN，最终都是针对每一个像素点进行softmax输出，然后计算交叉熵Loss】
- 只对ground-truth k层上的m×m个像素计算所有点的交叉熵，并求平均
  - For an Roi associated with ground-truth class k, Lmask is only defined on the k-th mask(other mask outputs do not contributed to the loss)

Mask Representation

不同于其他的RoI特征提取，最后将RoI提取成一个固定长度的特征向量vector，Mask R-CNN最终将RoI区域预测称一个m×m的mask
- 如SPPNet最终输出的是一个固定长度的特征向量，丢失了空间信息
- 该操作有效的保留了空间信息，但是需要更加精确的像素对齐操作

RoIAlign操作

为了满足Mask Representation的精度要求，提出了RoIAlign操作
标准的RoIPool操作，能够将RoI提取成一个小的特征层【7x7】
- 将floating-number的RoI量化，离散化成小网格，然后进行max pooling操作，生成一个固定大小的特征图
- 在离散化过程中，连续的坐标x被对应在特征层上的位置为[x16]【取整】
- 这种操作，使得RoI【proposal】对应特征图像上的位置发生了偏移，导致误差
- While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.
RoIAlign操作
- 将feature层与input层精确对齐
- 使用x16代替[x16]
- 使用双线性插值计算input对应RoI bin上的四个坐标值
- 该操作对网络的能力提升非常大【参考Table 2c】

Network Architecture

共享基础网络【特征提取】
之后对每一个RoI进行多任务预测
- bounding-box recognition(cls & reg)
- mask prediction
基础网络
- ResNet
- ResNeXt
- FPN(Feature Pyramid Network)
  - FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input
- 采用ResNet+FPN作为基础网络，最终的效果最好

Implementation Detail

训练
- 样本生成：
  - 正样本：RoI（IoU>0.5）
  - 负样本：其他
- Mask Loss
  - 只计算postive roi上的loss
  - mask的真值ground-truth：为RoI和ground-truth的交集
- 参数
  - mini-batch为2
  - 每张图像产生N个RoI区域，其中正负样本的比例为1:3
    - N=64 for C4
    - N=256 for FPN
    - 学习率：0.02/120k –> 0.002
    - 迭代次数：160k
    - momentum：0.9
    - decay：0.0001
- RPN参数：
  - 5 scales
  - 3 aspect ratios
测试
- proposal
  - 300 for C4
  - 1000 for FPN
  - 在这些候选区域上进行box reg，然后进行nms操作
  - mask只运行在最高的100个检测框中，输出K masks【只用第k-th mask】
    - 这点和训练不同，但是可以有效提高运行效率
- 将m×x的mask output缩放到RoI大小，并进行二值化【0.5阈值】

实验

实验一

Instance Segmentation

说明：

Mask RCNN效果明显
ResNeXt-101-FPN网络的表达能力最好

实验二

关键改进分析

说明：

Backbone Architecture
- 网络越深，效果越好
Multinomial vs. Independent Masks
- 分割任务与分类任务解耦合后，效果更好
class-Specific vs. Class-Agnostic Masks:
- 输出K个m×m个分割mask和输出1个m×m个分割mask【不区分类别信息】进行对比，发现区别类别分割后，效果更好
- AP从29.7升到了30.3
RoIAlign
- 通过上述的图c，d可以看出，该操作能提高大概3points
- RoiWarp采用双线性插值，但是没有对input和feature层坐标进行对齐，从而效果比RoIAlign差很多
Mask Branch
- 进行分割的时候，FCN要好于MLP

实验三

目标检测Bounding Box Detection

说明：

RoIAlgin作用：
- 单独使用Faster R-CNN和RoIAlign操作，检测效果提升了%1mAP
- 说明RoIAlign对目标检测效果又一定的帮助
多任务：
- 分割能使Faster RCNN检测提升1-2%mAP
基础网络：
- ResNeXt网络使得Faster RCNN检测提升了1%mAP

实验四

时间测试

说明：

测试
- ResNet-101-FPN共享RPN和Mask R-CNN stages特征，运行时间大概在195ms【Nvidia Tesla M40 GPU】+ 15ms【resizing the outputs to the original resolution】
- 因为ResNet101-C4基础网络大概需要～400ms，因此作者部推荐使用该网络
训练
- ResNet-50-FPN: COCO trainval35k–> 32h
- ResNet-101-FPN: COCO trainval35k–>44h

Mask R-CNN for Human Pose Estimation

Mask R-CNN框架经过简单的修改就可以进行Human Pose Estimation

类别数量：Human Pose中的keypoints个数
将m×m分割mask的label／target变成one-hot形式
- target：只有一个pixel标记为前景
- 对m×m个输出计算交叉熵，作为loss【这样可以激励网络学习只检测单个点】
- 输出仍然是K个m×m个mask
采用ResNet-FPN作为基础网络，
- 然后通过一系列的3x3 512-d的卷积进行特征融合，
- 之后，通过解卷积deconv进行上采样到56x56【原始的是28x28】
  - 像素越高，效果越好
训练
- 数据集：COCO trainval35k
- randomly scales from [640, 800]
  - 测试的时候，固定为800
- 迭代次数：90K
- 学习率：0.02/60k–>0.002/8k–>0.0002
- NMS：0.5

效果

说明：

相比较其他方法，该方法简单有效，速度快5fps
- More importantly, we have a unified model that can simultaneously predict boxes, segments, and keypoints while running at 5fps.
多任务效果更好
- Adding a segment branch(for the person category) improves the APkp to 63.1。

说明：

多任务效果好
- Adding the mask branch to the box-only(i.e. Faster R-CNN) or keypoint-only versions consistently improves these tasks。
- 但是增加keypoint-only稍微影响了AP【仅限该实验，不能推广到其他任务】

说明：

对于Human Pose Estimation任务，RoIAlign依然其到很大的作用

Experiments on Cityscapes

说明：

推广能力很强

参考文献

https://arxiv.org/abs/1703.06870

阅读全文

0 0