优化:深度神经网络Tricks【笔记】

来源:互联网 发布:淘宝运营一般工资多少 编辑:程序博客网 时间:2024/05/27 20:51

Slide:http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf

博文:http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html


  1)data augmentation;   

    2)pre-processing on images;    

      3)initializations of Networks;    

       4)some tips during training;     

        5)selections of activation functions;                        

       6)diverse regularizations

      7)some insights found from figures and finally    

   8)methods of ensemble multiple deep networks.

Sec. 1: Data Augmentation

训练的时候,训练集有限,可以用Data Augmentation来扩充数据集合;

  • (1)、简单的crops: horizontally flipping, random crops andcolor jittering

  • (2)、结合(1)中简单的处理

  • (3)、Krizhevsky et al. [1] 提出的 fancy PCA : alters the intensities of the RGB channels in training images.

Sec. 2: Pre-Processing

(1)、 zero-center + normalize:

python实现

>>> X -= np.mean(X, axis = 0) # zero-center>>> X /= np.std(X, axis = 0) # normalize

(2)、 PCA Whitening:zero-center-->计算covariance matrix(数据之间的correlation结构)-->decorrelate数据-->whitening

python实现

>>> X -= np.mean(X, axis = 0) # zero-center>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

 decorrelate data :通过将原来的数据(除了zero-centres)映射带eigenbasis

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix>>> Xrot = np.dot(X, U) # decorrelate the data

 whitening:用eigenvalue将eigenbasis的每个维度分开来normalize the scale

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)


Sec. 3: Initializations

(1)、All Zero Initialization

理想状态下认为一般权重为正数一半为负数再见过适当的data normalization

缺点:no source of asymmetry between neurons 

(2)、Initialization with Small Random Numbers:

优点:symmetry breaking

 思想:the neurons are all random and unique in the beginning,

eg1: weightssim 0.001times N(0,1), whereN(0,1) is a zero mean, unit standard deviation gaussian. 

eg2:small numbers drawn from a uniform distribution,

(3)、Calibrating the Variances

思想:normalize the variance of each neuron's output to 1 ,但是不会考虑ReLUs

python实现:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

(4)、Current Recommendation

 He et al. [4] 关注 ReLUs:variance :2.0/n 

python实现:

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation.

Sec. 4: During Training

  • Filters and pooling size.  input images: power-of-2  ;  filter (e.g.,3times 3)  ;strides (e.g., 1) with zeros-padding;  pooling :eg: 2times 2.

  • Learning rate.利用validation set ,再次 Ilya Sutskever [2]:divide the gradients by mini batch size

  • Fine-tune on pre-trained models. 考虑:新的数据集的大小&和预训练模型训练数据集的相似性

  • (1)、如果自己的数据和预训练的相似 ,直接在从预训练模型的高层提取的特征尚训练一个 linear classifier

  • (2)、如果有许多数据,可以用small learning rate微调预训练模型的高层

  • (3)、如果自己的数据集和预训练模型的数据集差异很大,但是有很多训练图像,大部分的layers需要用小的learning rate在自己的数据集上进行 fine-tuned

  • (4)、如果自己的数据集小而且与预训练模型数据集差异很大,那就只训练一个 linear classifier.

  • table

Sec. 5: Activation Functions :non-linearity

                                                         


Sigmoid

sigmod 


 large negative numbers become 0 and large positive numbers become 1. 

  1. sigmoids saturate and kill gradients. .

  2. Sigmoid outputs are not zero-centered



tanh(x)

tanh 

 range [-1, 1].

1、 its activations saturate

2、zero-centered                                                                                                 

Rectified Linear Unit

relu 

f(x)=max(0,x)        

  1. (Pros) do  expensive operations (exponentials, etc.),

  2. (Pros) ReLUs does not suffer from saturating.

  3. (Pros) accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic 

  4. gradient descent (linear, non-saturating form.)

  5. (Cons)  fragile during training and can “die”.                                            


Leaky ReLU

lrelu 

 fix the “dying ReLU” problem. 

f(x)=alpha x  ifx<0 (alpha : a small constant)

f(x)=x  ifxgeq 0

(cons)the results are not always consistent.                                            


Parametric ReLU : 

relufamily 


 PReLU, alpha_i is learned from data not pre-defiined[[4]]

 Leaky ReLU alpha_i is fixed.  

RReLU,alpha_{ji} is a random variable  in a given range in the training, 

and then fixed in the testing[[5]] (cons) reduce overfitting


Randomized ReLU

  RReLU,alpha_{ji} 在训练时是给定范围的随机变量 ,但在测试时是固定的。[[5]] 

relures  


Sec. 6: Regularizations

  • L2 regularization : add  frac{1}{2}lambda w^2 to the objective,lambda :regularization strength. ( heavily penalizing peaky weight vectors and preferring diffuse weight vectors)

  • L1 regularization: add lambda |w| to the objective. 结合:lambda_1 |w|+lambda_2 w^2 (Elastic net regularization). 

  • Max norm constraints. enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.parallel vec{w} parallel_2 <c.c (always 3 or 4).update are bounded so the nwtwork wont explores..

  • Dropout : [6] only updating the parameters of the sampled network based on the input data . 

dropout 

 [6].  training:  keeping a neuron active with some 

probability p (a hyper-parameter), or setting it to zero .

testing: no dropout 

dropout ratiop=0.5 is a reasonable default

Sec. 7: Insights from Figures


  • learning rate 

  • loss curve.: the “width” of the curve is related to the batch size. 

  • accuracy curve. trainfigs

  

Sec. 8: Ensemble[8]


  • Same model, different initialization. 用交叉验证集来决定最好的超参数 hyperparameters, 然后用这些超参数来训练多个 models ,但是随机初始化.

  • Top models discovered during cross-validation. 用交叉验证集来决定最好的超参数 hyperparameters,然后选出前n个最好的models来ensemble.(风险是可能包含未达标准的model).

  • Different checkpoints of a single model. training非常expensive的情况下, 选取一个single network中不同时刻的不同的 checkpoints 来ensemble. (缺乏多样性,但是cheap).

  • Some practical examples. 如果你的任务是high-level image semantic: 可以在不同的数据集上使用多个深度模型来提取不同的互补的深度representations. 

Miscellaneous

Problems:

data:class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images. 

method1:balance the training data by directly up-sampling and down-sampling the imbalanced data[10].

method2: crops processing[7].

method3 :adjust the fine-tuning strategy

阅读全文
0 0
原创粉丝点击