优化：深度神经网络Tricks【笔记】

来源：互联网发布：淘宝运营一般工资多少编辑：程序博客网时间：2024/05/27 20:51

Slide：http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf

博文：http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

1)data augmentation;

2)pre-processing on images;

3)initializations of Networks;

4)some tips during training;

5)selections of activation functions;

6)diverse regularizations;

7)some insights found from figures and finally

8)methods of ensemble multiple deep networks.

Sec. 1: Data Augmentation

训练的时候，训练集有限，可以用Data Augmentation来扩充数据集合；

（1）、简单的crops： horizontally flipping, random crops andcolor jittering.
（2）、结合（1）中简单的处理
（3）、Krizhevsky et al. [1] 提出的 fancy PCA ： alters the intensities of the RGB channels in training images.

Sec. 2: Pre-Processing

（1）、 zero-center + normalize:

python实现

>>> X -= np.mean(X, axis = 0) # zero-center>>> X /= np.std(X, axis = 0) # normalize

（2）、 PCA Whitening：zero-center-->计算covariance matrix（数据之间的correlation结构）-->decorrelate数据-->whitening

python实现

>>> X -= np.mean(X, axis = 0) # zero-center>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

decorrelate data :通过将原来的数据（除了zero-centres）映射带eigenbasis

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix>>> Xrot = np.dot(X, U) # decorrelate the data

whitening:用eigenvalue将eigenbasis的每个维度分开来normalize the scale

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Sec. 3: Initializations

（1）、All Zero Initialization

理想状态下认为一般权重为正数一半为负数再见过适当的data normalization

缺点：no source of asymmetry between neurons

（2）、Initialization with Small Random Numbers：

优点：symmetry breaking

思想：the neurons are all random and unique in the beginning,

eg1： weightssim 0.001times N(0,1) , where N(0,1) is a zero mean, unit standard deviation gaussian.

eg2：small numbers drawn from a uniform distribution,

（3）、Calibrating the Variances

思想：normalize the variance of each neuron's output to 1 ，但是不会考虑ReLUs

python实现:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

（4）、Current Recommendation

He et al. [4] 关注 ReLUs：variance ： 2.0/n

python实现：

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation.

Sec. 4: During Training

Filters and pooling size. input images： power-of-2 ； filter (e.g.,) ；strides (e.g., 1) with zeros-padding; pooling :eg: .

Learning rate.利用validation set ，再次 Ilya Sutskever [2]：divide the gradients by mini batch size
Fine-tune on pre-trained models. 考虑：新的数据集的大小&和预训练模型训练数据集的相似性
（1）、如果自己的数据和预训练的相似，直接在从预训练模型的高层提取的特征尚训练一个 linear classifier
（2）、如果有许多数据,可以用small learning rate微调预训练模型的高层
（3）、如果自己的数据集和预训练模型的数据集差异很大，但是有很多训练图像,大部分的layers需要用小的learning rate在自己的数据集上进行 fine-tuned
（4）、如果自己的数据集小而且与预训练模型数据集差异很大，那就只训练一个 linear classifier.

Sec. 5: Activation Functions ：non-linearity

Sigmoid

large negative numbers become 0 and large positive numbers become 1.

sigmoids saturate and kill gradients. .
Sigmoid outputs are not zero-centered.

tanh(x)

range [-1, 1].

1、 its activations saturate

2、zero-centered

Rectified Linear Unit

f(x)=max(0,x)

(Pros) do expensive operations (exponentials, etc.),
(Pros) ReLUs does not suffer from saturating.
(Pros) accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic
gradient descent (linear, non-saturating form.)
(Cons) fragile during training and can “die”.

Leaky ReLU

fix the “dying ReLU” problem.

f(x)=alpha x if x<0 ( alpha : a small constant)

f(x)=x if xgeq 0 ,

(cons)the results are not always consistent.

Parametric ReLU :

PReLU, alpha_i is learned from data not pre-defiined[[4]]

Leaky ReLU alpha_i is fixed.

RReLU, $alpha_{ji}$ is a random variable in a given range in the training,

and then fixed in the testing[[5]] (cons) reduce overfitting

Randomized ReLU

RReLU, $alpha_{ji}$ 在训练时是给定范围的随机变量 ,但在测试时是固定的。[[5]]

Sec. 6: Regularizations

L2 regularization : add $frac{1}{2}lambda w^2$ to the objective, :regularization strength. （ heavily penalizing peaky weight vectors and preferring diffuse weight vectors）
L1 regularization: add to the objective. 结合: （Elastic net regularization).
Max norm constraints. enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. $parallel vec{w} parallel_2 <c$ . (always 3 or 4).update are bounded so the nwtwork wont explores..
Dropout ： [6] only updating the parameters of the sampled network based on the input data .

[6]. training： keeping a neuron active with some

probability (a hyper-parameter), or setting it to zero .

testing： no dropout

dropout ratio p=0.5 is a reasonable default

Sec. 7: Insights from Figures

learning rate
loss curve.： the “width” of the curve is related to the batch size.
accuracy curve.

Sec. 8: Ensemble[8]

Same model, different initialization. 用交叉验证集来决定最好的超参数 hyperparameters, 然后用这些超参数来训练多个 models ，但是随机初始化.
Top models discovered during cross-validation. 用交叉验证集来决定最好的超参数 hyperparameters,然后选出前n个最好的models来ensemble.（风险是可能包含未达标准的model）.
Different checkpoints of a single model. training非常expensive的情况下, 选取一个single network中不同时刻的不同的 checkpoints 来ensemble. （缺乏多样性，但是cheap）.
Some practical examples. 如果你的任务是high-level image semantic：可以在不同的数据集上使用多个深度模型来提取不同的互补的深度representations.

Miscellaneous

Problems:

data:class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.

method1:balance the training data by directly up-sampling and down-sampling the imbalanced data[10].

method2: crops processing[7].

method3 :adjust the fine-tuning strategy

阅读全文

0 0