EnhanceNet的简要笔记

来源:互联网 发布:天津网络报警平台 编辑:程序博客网 时间:2024/05/21 11:20

论文名称: EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis
归类: ICCV2017, 作者: Sajjadi, Mehdi~S.~M. 等

一、难点(看论文时遇到的问题):

  • E/P/T/A 是怎样进行结合的? 相加还是如何?
    (E: MSE, P: Perceptual similarity, T: Texture matching, A: Adversarial Training
  • T/A的具体执行过程?

二、问题:

  • 传统的方法基于pixel-wise reconstruction measures, 如PSNR, 该衡量方法生成的图像与我们视觉感知不符。
    (即,即便在衡量标准下得到的“分数”很高,认为生成的图像很好,但我们看上去却有over-smoothed的感觉,丢失了一定的high-frequency信息)
    这里写图片描述

三、改进方案:

  1. 在损失函数上做文章,以 creating realistic texture。(如题目所言,Through Automated Texture Synthesis
  2. 在performance evaluation上,用Object recongnition performance来替代传统的PSNR, SSIM等标准。

四、具体方法:

4.1 Network Arthitecture

这里写图片描述

作者对该网络结构的几个地方做了特别说明:

(1). 网络的主体部分使用了residual blocks。原因是,相比于stacked convolution layers, 其收敛速度更快。

Reference: 残差的提出【2】, 残差首次用于SR【3】

(2). 作者探讨了为什么会选择nearest neighbor upsampling.

A. Bicubic interpolation introduces redundancies to the input image and leads to higher computational cost.

B. Convolution transpose layers (which unsample the feature activations inside the network) would produce checkerboard artifacts in the output. (棋盘格效应), 需要通过额外的regularization term来修正。增加了计算量。

C. 可以用NN upsampling + Conv 来替代Transposed convolutional layers. 在某些特定的模型下依然会产生棋盘格效应,但在大多数complex models里面都不需要额外添加正则化项。

Reference:Bicubic interpolation的使用【4】,Convolution transpose layers的使用【5】 Nearest neighbor upsampling【6】

(3). 输入的是低分辨率的图,输出的是残差图像。作用: It does not need to learn the identity functioin forILR.




4.2 Training and loss functions: (重点部分)
Pixel-wise loss in the image-space    传统的基于MSE的方法Perceptual loss in feature space    把最后生成的图像映射到某一特征空间,再做MSETexture matching loss             映射到某一特徵空间还不够,再进行精细的纹理匹配,Adversarial training         在特定的Descriminative model下,使得生成的图像无法被识别为是生成的

(1):传统的基于MSE的loss function:

LE=||IestIHR||22(1)

(2):Perceptual similarity measure:*

Both Iest and IHR are first mapped into a feature space by a differentiable function ϕ before computing their distance.

LP=||ϕ(Iest)ϕ(IHR)||22(2)

作用:encourage the networks to produce images that have similar feature representations.

实现:ϕ用pre-trained implementation of VGG-19 network.

具体细节:Iest映射到the second and the fifth pooling layers,IHR 映射到the second and the fifth pooling layers(为了同时抽取出low-level and high-level features)。 两两计算l2norm

https://www.cs.toronto.edu/~frossard/post/vgg16/



(3):Textture matching

阐释: 公式2只是粗略地将生成图像和ground-truth image映射到特徵空间后再进行比较。但这些特征是极度抽象的存在,其还不能很明确地表明,两者同时能包含具体相同的纹理。而这一部分正是要解决这个问题,即对他们具体的纹理进行MSE. Loss function为:

LT=||G(ϕ(Iest))G(ϕ(IHR))||22(3)

https://arxiv.org/pdf/1409.1556.pdf

步骤: 若使用MSCOCO图像集, 裁剪出256*256的图像。结合上图的VGG-19,推导每一层的大小。

input: 256 * 256 * 3256 * 256 *  64 (2块, Conv3)     +    128 * 128 *   64(2块,  Maxpooling)128 * 128 * 128 (2块, Conv3)     +     64 *  64 * 128 (2块,  Maxpooling) 64 *  64 * 256 (4块, Conv3)     +     32 *  32 * 256 (4块,  Maxpooling) 32 *  32 * 512 (4块, Conv3)     +     16 *  16 * 512 (4块,  Maxpool) 16 *  16 * 512 (4块, Conv3)     +      8 *   8 * 512 (4块,  Maxpool)后面的FC, FC, FC省略。

可以看出,the second pooling layer’s size 为: 64*64*128

the fifth pooling layer’s size 为: 8*8*512

[VGG19Grammatrix]




(4):(这部分需查看更多的资料完善)

直接引用原文中的话:

Instead, the following learning strategy yields better results and a more stable training: we keep track of the average performance of the discriminator on true and generated images within the previous training batch and only train the discriminator in the subsequent step if its performance on either of those two samples is below a threshold.

这里写图片描述

简单的理解: 给定一个discriminative model,让他对生成的图像{Iest}和ground-truth image {Ihr}分别进行评估。若两个评估值较大且比较接近,此时可以说明{Iest}约等于{Ihr};若两者的值都小于某个阈值,则需要对该model进行重新训练。




对以上开篇提出的两个问题的解答:
(1). 引用文章附录的原话:

For the weights, we chose the combination that produced the most realistically looking results. The exact values of the weights for the different losses are given in Table 2.

其中table2为:
这里写图片描述

可知,最后的loss function有如下形式(以ENET-PAT为例):

LPAT=WPpool2LP2+WPpool5LP5+WLALA+WLconv1.1LT1+WLconv2.1LT2+WLconv3.1LT3(4)

(2)有关第二个问题的解答已在正文说明




Reference:

[1]. EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis
[2]. Deep residual learning for image recognition. In CVPR, 2016.
[3]. Accurate image superresolution using very deep convolutional networks. In CVPR, 2016
[4]. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
[5]. Fully convolutional networks for semantic segmentation. In CVPR, 2015
[6]. Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconvcheckerboard/, 2016.