General Trainning Strategy in Caffe

来源:互联网 发布:保加利亚 知乎 编辑:程序博客网 时间:2024/04/26 12:45

首先祭出caffe.proto中对于solver.prototxt的一些参数定义:

  // The number of iterations for each test net.  repeated int32 test_iter = 3;  // The number of iterations between two testing phases.  optional int32 test_interval = 4 [default = 0];  optional bool test_compute_loss = 19 [default = false];  // If true, run an initial test pass before the first iteration,  // ensuring memory availability and printing the starting value of the loss.  optional bool test_initialization = 32 [default = true];  optional float base_lr = 5; // The base learning rate  // the number of iterations between displaying info. If display = 0, no info  // will be displayed.  optional int32 display = 6;  // Display the loss averaged over the last average_loss iterations  optional int32 average_loss = 33 [default = 1];  optional int32 max_iter = 7; // the maximum number of iterations  // accumulate gradients over `iter_size` x `batch_size` instances  optional int32 iter_size = 36 [default = 1];  // The learning rate decay policy. The currently implemented learning rate  // policies are as follows:  //    - fixed: always return base_lr.  //    - step: return base_lr * gamma ^ (floor(iter / step))  //    - exp: return base_lr * gamma ^ iter  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)  //    - multistep: similar to step but it allows non uniform steps defined by  //      stepvalue  //    - poly: the effective learning rate follows a polynomial decay, to be  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)  //    - sigmoid: the effective learning rate follows a sigmod decay  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))  //  // where base_lr, max_iter, gamma, step, stepvalue and power are defined  // in the solver parameter protocol buffer, and iter is the current iteration.  optional string lr_policy = 8;  optional float gamma = 9; // The parameter to compute the learning rate.  optional float power = 10; // The parameter to compute the learning rate.  optional float momentum = 11; // The momentum value.  optional float weight_decay = 12; // The weight decay.  // regularization types supported: L1 and L2  // controlled by weight_decay  optional string regularization_type = 29 [default = "L2"];  // the stepsize for learning rate policy "step"  optional int32 stepsize = 13;  // the stepsize for learning rate policy "multistep"  repeated int32 stepvalue = 34;

不同lr_policy对应learning_rate变化曲线见文末。需要注意的地方,下面的参数是全局适用的(也就是无论选择何种lr_policy):

  optional float momentum = 11; // The momentum value.  optional float weight_decay = 12; // The weight decay.  // regularization types supported: L1 and L2  // controlled by weight_decay  optional string regularization_type = 29 [default = "L2"];

下面探讨一下momentum, weight_decay, regularization_type三个参数:

其中momentum主要是考虑了前一步的梯度,加速收敛,比如在SGD中对应下式的μ

Vt+1=μVtαL(Wt)
Wt+1=Wt+Vt+1


引用图片来自http://www.cnblogs.com/denny402/p/5074212.html

weigth_decay是对loss中全局正则项的权重,regularization_type则是选择何种正则项(L1或者L2),详见下面的引用文字(原文链接):

The weight_decay meta parameter govern the regularization term of the neural net.

During training a regularization term is added to the network’s loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.

As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.

Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting

regularization_type: "L1"

However, since in most cases weights are small numbers (i.e.,-1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.

While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.

在网络定义常常会看到lr_mult和decay_mult参数,如下

layer {  name: "conv1_1"  type: "Convolution"  bottom: "data"  top: "conv1_1"  param {    lr_mult: 1    decay_mult: 1  }  param {    lr_mult: 2    decay_mult: 0  }  convolution_param {    num_output: 64    pad: 1    kernel_size: 3  }}

既然base_lr和weight_decay是全局的学习率和正则项权重,lr_mult和decay_mult的设置就是为了灵活地改变每一层。具体来说,对于特定的某一层,其学习率和正则化约束的权重为base_lr*lr_mult和weight_decay*decay_mult。下面caffe.proto中对于这两个参数的定义

  // The multiplier on the global learning rate for this parameter.  optional float lr_mult = 3 [default = 1.0];  // The multiplier on the global weight decay for this parameter.  optional float decay_mult = 4 [default = 1.0];

不同lr_policy对应的base_lr曲线图


图片来自http://blog.csdn.net/langb2014/article/details/51274376

原创粉丝点击