图像语义分割

来源：互联网发布：石家庄seo 编辑：程序博客网时间：2024/05/16 07:09

1. FCN

2. CRF as RNN

http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Zheng_Conditional_Random_Fields_ICCV_2015_paper.pdf

motivation

作者指出，将CNN运用到图像语义分割的时候面临两个问题，可以概括为【该精细的地方模糊，该平滑的地方反而被切碎了】：1.逐层卷积造成感受野较大，max-pooling更加使得输出粗糙，结果就是容易出现边缘模糊和色块效应；2.CNN没有根据像素相似性、空间一致性进行平滑约束，这会导致分割结果出现细小的杂散区域。原文如下：

Firstly, traditional CNNs have convolutional filters with large receptive fields and hence produce coarse outputs when restructured to produce pixel-level labels [35]. Presence of maxpooling layers in CNNs further reduces the chance of getting a fine segmentation output [9]. This, for instance, can result in non-sharp boundaries and blob-like shapes in semantic segmentation tasks. Secondly, CNNs lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output. Lack of such smoothness constraints can result in poor object delineation and small spurious regions in the segmentation output [57, 56, 30, 37].

另外，利用CRF来提升CNN语义分割的工作并没有将CRF集成到深度网络之中，所以没有发挥CRF的优势。原文如下：

One way to utilize CRFs to improve the semantic labelling results produced by a CNN is to apply CRF inference as a post-processing step disconnected from the training of the CNN [9]. Arguably, this does not fully harness the strength of CRFs since it is not integrated with the deep network – the deep network cannot adapt its weights to the CRF behaviour during the training phase.

基于此，本文提出FCN-CRF的网络结构，其主要技巧是将基于高斯对（Gaussian pairwise）CRF的均值域（mean-field）推理看成RNN，从而可以和FCN共同训练。

remark1: CRF

构建图G=(V,E)，V表示随机变量X1,X2,...,XN，其中N代表图片像素个数。记全局观测变量（在这里是一个图片）为I。用Gibbs分布刻画的CRF可以对(I,X)建模，即P(X=x|I)=1Z(I)exp[−E(x|I)]，其中Z(I)是配分函数，E(x)是势函数（为了表达的方便，略去了概率的条件I）。

remark2: 均值场（mean field）

介绍一下基本概念：
（1）最大后验概率（Maximum a posteriori，简称MAP）

X * = arg max X P (X | I)

（2）最大后验边缘概率（Maximum posterior marginals，简称MPM）

X * i = arg max X i P (X i | I)

（3）变分推理（Variational Inference）
即最小化变分函数

Q(X)和概率分布

P(X)之间的KL散度（KL-divergence）：

D (Q | P) = E Q [log Q ( X ) P ( X )]

对于这种不可解（intrackable）的问题，通常要用近似推理的方法解决。其中均值场（mean field）是常用的方法之一，其限制变分函数为可分解分布，即Q(X)=ΠiQ(Xi)，这样在计算积分的时候可以转化成多个低维的积分，降低求解的复杂度。

记X为观测数据，Z为包括隐变量和参数在内的集合，待优化的目标即为Q(Z)，即：

min D (Q | | P) = - \int Q (Z) ln {P ( Z | X ) Q ( Z )} d Z

考虑到：

ln P (X) = = \int Q (Z) ln {P ( Z , X ) Q ( Z )} d Z - \int Q (Z) ln {P ( Z | X ) Q ( Z )} d Z L (Q) + D (Q | | P)

在evidence给定的情况下，min KL散度等价于max

L(Q)。利用可分解的特性，可以得到（具体的推导过程可以参见：[CRF infer]，Bishop06 Chapter 10）：对于当前待估计的变分函数，其最优形式就是其他变分函数的期望，所以称为mean-field。即：

Q * j (X j) \propto exp E i \neq j [ln P (X, Z)]

formulation

在这里，势函数定义为：

E (x) = \sum i ψ i (x i) + \sum i < j ψ p (x i, x j)

ψ p (x i, x j) = μ p (x i, x j) \sum m = 1 M w (m) k (m) G (f i, f j)

其中，k(m)G表示高斯核，fi表示像素i的特征（比如空间位置，RGB值等），μp(xi,xj)表示标签相容性函数。

本文一元势函数衡量了像素i标签为xi的概率的负数（即交叉熵损失，从FCN得到），二元势函数ψp(xi,xj)衡量了将像素i,j标记为xi,xj的损失，这个损失的定义使得像素之间的平滑性和连续性有所保证。原文如下：

unary energies are obtained from a CNN, which, roughly speaking, predicts labels for pixels without considering the smoothness and the consistency of the label assignments. The pairwise energies provide an image data-dependent smoothing term that encourages assigning similar labels to pixels with similar properties.

在一次迭代过程中，

3. dilation

n. weakly-supervised semantic segmentation

于SIDAS2017会议（清华大学主办）上听到Cheng Mingming（http://mmcheng.net/zh/）的主题报告《弱监督图像理解》（Weakly Supervised Image Understanding，WSIU），提出一个图像理解技术路线，即从底层简单任务到高层复杂任务步步进军，因为简单任务不需要大量精细的标注，对于高层任务也有辅助作用，因此存在一种弱监督的高层任务的解决方法。
WSIU技术路线
路线虽然美好，但没有谈从底层到高层的具体技术，听上去云里雾里，还是要看论文：Bottom-Up Top-Down Cues forWeakly-Supervised Semantic Segmentation (https://arxiv.org/abs/1612.02101)。

technical route

本文的motivation较为简单，如果语义分割仅仅需要物体级别的标注，在实际场景中应用将更为广泛而便利，因此第一部分主要介绍了技术路线。
作者通过一个CNN-EM的框架，实现了仅有物体类别标注（而非像素级别标注）的弱监督图像语义分割。其中关键在于，如何良好地初始化EM算法的参数，即利用题目提到的两种线索——其中Bottom-Up Cues指的是saliency maps（类别未知的，class-agnotic），Top-Down Cues指的是attention maps（类别特定的，class-specific）——结合所得的近似的ground-truth对EM算法进行初始化训练。原文如下：

We provide an informed initialization to the EM algorithm by training an initial model for the semantic segmentation task using an approximate ground-truth obtained using the combination of class-agnostic saliency map [15] and class-specific attention maps [36] on a set of simple images with one object category (ImageNet [11] classification dataset).

结合bottom-up和top-down线索

更多弱监督语义分割的工作可以参看：https://www.zhihu.com/question/53263115 或者上述论文的第二部分。论文基于实验结果对大部分工作都批判了一番，因为这些工作运用了更为高级的监督信息（比如bounding boxes等），原文如下：

Experimentally we have found that this simple way of combining bottom-up with top-down cues on the ImageNet dataset (with no images from PASCAL VOC 2012) allows us to obtain an initial model capable of outperforming all current state-of-the-art algorithms for the weakly-supervised semantic segmentation task on the PASCAL VOC 2012 dataset. This is surprising since these algorithms are significantly more complex and they rely on higher degrees of supervision such as bounding boxes, points/squiggles and superpixels. This clearly indicates the importance of learning from simple images before delving into more complex ones.

formulation

记L={l0,l1,...,lc}为离散的语义标签集，其中c是类别数量，l0表示背景。记物体标签集Z=L∖l0，数据集D={Ii,zi}，其中Ii表示第i张图片，zi∈Z为图片中出现的物体标签。
用图模型表示语义分割的结果y={y1,y2,...,yn}（其中n是像素数量）为：

P (I, y, z; θ) = P (I) P (y | I, z; θ) P (z)

reference

[CRF infer] http://www.slideserve.com/sileas/mean-field-approximation-for-crf-inference

阅读全文

0 0