神经网络的编程实现

来源：互联网发布：注册淘宝账号会员名字编辑：程序博客网时间：2024/06/06 03:26

CNN的编程实现可以分为以下几大模块：

layers层的实现，包括每个层的前向传播和反向传播函数
CNN网络类的实现，通过堆叠layers层中实现的各种layer，构建网络的结构
优化方法的实现包括SGD、SGD+Momentum、Adam等方法
Solver类的实现，对构建的CNN网络采用实现的优化方法进行优化，求解参数

layers层

layers层的实现主要包含对不同种类的网络层实现前向传播和反向传播
其主要的形式如下（以Affine为例）：

def affine_forward(x, w, b):    ...    return out, cachedef affine_backward(dout, cache)    return dx, dw, db

其中前向传播的实现中除了返回该层的输出值外，还要存储一些在反向传播中需要的数据，如affine层中存储的可能是x， w， b
另外由于Conv层和Relu层经常连续使用，可以单独实现Conv-Relu层的前向和反向传播，而不去考虑中间的数据（附上CS231n layers.py）文件

import numpy as npimport copy def affine_forward(x, w, b):  """  Computes the forward pass for an affine (fully-connected) layer.  The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N  examples, where each example x[i] has shape (d_1, ..., d_k). We will  reshape each input into a vector of dimension D = d_1 * ... * d_k, and  then transform it to an output vector of dimension M.  Inputs:  - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)  - w: A numpy array of weights, of shape (D, M)  - b: A numpy array of biases, of shape (M,)  Returns a tuple of:  - out: output, of shape (N, M)  - cache: (x, w, b)  """  out = None  #############################################################################  # TODO: Implement the affine forward pass. Store the result in out. You     #  # will need to reshape the input into rows.                                 #  #############################################################################  v_x = x.reshape(x.shape[0], -1)#NXD  out = v_x.dot(w)+b.reshape(1, -1) #NXM  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  cache = (x, w, b)  return out, cachedef affine_backward(dout, cache):  """  Computes the backward pass for an affine layer.  Inputs:  - dout: Upstream derivative, of shape (N, M)  - cache: Tuple of:    - x: Input data, of shape (N, d_1, ... d_k)    - w: Weights, of shape (D, M)  Returns a tuple of:  - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)  - dw: Gradient with respect to w, of shape (D, M)  - db: Gradient with respect to b, of shape (M,)  """  x, w, b = cache  dx, dw, db = None, None, None  #############################################################################  # TODO: Implement the affine backward pass.                                 #  #############################################################################  dx = np.dot(dout, w.T).reshape(x.shape)  dw = np.dot(x.reshape(x.shape[0],-1).T, dout)  db = np.sum(dout, axis=0).reshape(b.shape)  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dx, dw, dbdef relu_forward(x):  """  Computes the forward pass for a layer of rectified linear units (ReLUs).  Input:  - x: Inputs, of any shape  Returns a tuple of:  - out: Output, of the same shape as x  - cache: x  """  out = copy.deepcopy(x)  #############################################################################  # TODO: Implement the ReLU forward pass.                                    #  #############################################################################  out[x<0] = 0  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  cache = x  return out, cachedef relu_backward(dout, cache):  """  Computes the backward pass for a layer of rectified linear units (ReLUs).  Input:  - dout: Upstream derivatives, of any shape  - cache: Input x, of same shape as dout  Returns:  - dx: Gradient with respect to x  """  dx, x = None, cache  #############################################################################  # TODO: Implement the ReLU backward pass.                                   #  #############################################################################  relu_out = copy.deepcopy(x)  relu_out[relu_out<0] = 0  relu_out[relu_out>0] = 1  dx = dout*(relu_out)  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dxdef batchnorm_forward(x, gamma, beta, bn_param):  """  Forward pass for batch normalization.  During training the sample mean and (uncorrected) sample variance are  computed from minibatch statistics and used to normalize the incoming data.  During training we also keep an exponentially decaying running mean of the mean  and variance of each feature, and these averages are used to normalize data  at test-time.  At each timestep we update the running averages for mean and variance using  an exponential decay based on the momentum parameter:  running_mean = momentum * running_mean + (1 - momentum) * sample_mean  running_var = momentum * running_var + (1 - momentum) * sample_var  Note that the batch normalization paper suggests a different test-time  behavior: they compute sample mean and variance for each feature using a  large number of training images rather than using a running average. For  this implementation we have chosen to use running averages instead since  they do not require an additional estimation step; the torch7 implementation  of batch normalization also uses running averages.  Input:  - x: Data of shape (N, D)  - gamma: Scale parameter of shape (D,)  - beta: Shift paremeter of shape (D,)  - bn_param: Dictionary with the following keys:    - mode: 'train' or 'test'; required    - eps: Constant for numeric stability    - momentum: Constant for running mean / variance.    - running_mean: Array of shape (D,) giving running mean of features    - running_var Array of shape (D,) giving running variance of features  Returns a tuple of:  - out: of shape (N, D)  - cache: A tuple of values needed in the backward pass  """  mode = bn_param['mode']  eps = bn_param.get('eps', 1e-5)  momentum = bn_param.get('momentum', 0.9)  N, D = x.shape  running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))  running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))  mean = np.mean(x, axis=0)  var = np.var(x, axis=0)  out, cache = None, None  if mode == 'train':    #############################################################################    # TODO: Implement the training-time forward pass for batch normalization.   #    # Use minibatch statistics to compute the mean and variance, use these      #    # statistics to normalize the incoming data, and scale and shift the        #    # normalized data using gamma and beta.                                     #    #                                                                           #    # You should store the output in the variable out. Any intermediates that   #    # you need for the backward pass should be stored in the cache variable.    #    #                                                                           #    # You should also use your computed sample mean and variance together with  #    # the momentum variable to update the running mean and running variance,    #    # storing your result in the running_mean and running_var variables.        #    #############################################################################    x_hat = (x-mean)/np.sqrt(var + eps)    cache = x, x_hat, mean, var, eps, gamma    running_mean = momentum * running_mean + (1 - momentum) * mean    running_var = momentum * running_var + (1 - momentum) * var          #############################################################################    #                             END OF YOUR CODE                              #    ################################################################  elif mode == 'test':    #############################################################################    # TODO: Implement the test-time forward pass for batch normalization. Use   #    # the running mean and variance to normalize the incoming data, then scale  #    # and shift the normalized data using gamma and beta. Store the result in   #    # the out variable.                                                         #    #############################################################################    x_hat = (x-running_mean)/np.sqrt(running_var + eps)    cache = x, x_hat, running_mean, running_var, eps, gamma    #############################################################################    #                             END OF YOUR CODE                              #    #############################################################################  else:    raise ValueError('Invalid forward batchnorm mode "%s"' % mode)  out = gamma*x_hat + beta  # Store the updated running means back into bn_param  bn_param['running_mean'] = running_mean  bn_param['running_var'] = running_var     return out, cachedef batchnorm_backward(dout, cache):  """  Backward pass for batch normalization.  For this implementation, you should write out a computation graph for  batch normalization on paper and propagate gradients backward through  intermediate nodes.  Inputs:  - dout: Upstream derivatives, of shape (N, D)  - cache: Variable of intermediates from batchnorm_forward.  Returns a tuple of:  - dx: Gradient with respect to inputs x, of shape (N, D)  - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)  - dbeta: Gradient with respect to shift parameter beta, of shape (D,)  """  dx, dgamma, dbeta = None, None, None  #############################################################################  # TODO: Implement the backward pass for batch normalization. Store the      #  # results in the dx, dgamma, and dbeta variables.                           #  #############################################################################  x, x_hat, mean, var, eps, gamma = cache  N, D = x.shape  temp = 1.0/np.sqrt(var + eps)  dL_dxhat = gamma*dout   dL_dsigma = -0.5*np.sum(dL_dxhat*(x-mean), axis=0, keepdims=True)*(temp**3)  dL_mu = -1*np.sum(dL_dxhat*temp, axis=0,keepdims=True)-2.0*dL_dsigma*np.mean(x-mean, axis=0,keepdims=True)  dx = dL_dxhat*temp + dL_dsigma*2.0*(x-mean)/N+dL_mu*1.0/N  '''  sample_mean = mean  sample_var = var  dx_normalized = dout * gamma       # [N,D]  x_mu = x - sample_mean             # [N,D]  sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]  dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3  dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True)\  -2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)  dx1 = dx_normalized * sample_std_inv  dx2 = 2.0/N * dsample_var * x_mu  dx = dx1 + dx2 + 1.0/N * dsample_mean  dout_media = dout * gamma  dx = dout_media / np.sqrt(var + eps)  dmean = -np.sum(dout_media / np.sqrt(var+eps),axis = 0)  dstd = np.sum(-dout_media * (x - mean) / (var + eps),axis = 0)  dvar = 1./2./np.sqrt(var+eps) * dstd  dx_minus_mean_square = dvar / x.shape[0]  dx_minus_mean = 2 * (x-mean) * dx_minus_mean_square  dx += dx_minus_mean  dmean += np.sum(-dx_minus_mean,axis = 0)  dx += dmean / x.shape[0]    sample_mean = mean  sample_var = var    dx_1 = gamma * dout  dx_2_b = np.sum((x - sample_mean) * dx_1, axis=0)  dx_2_a = ((sample_var+ eps) ** -0.5) * dx_1  dx_3_b = (-0.5) * ((sample_var + eps) ** -1.5) * dx_2_b  dx_4_b = dx_3_b * 1  dx_5_b = np.ones_like(x) / N * dx_4_b  dx_6_b = 2 * (x - sample_mean) * dx_5_b  dx_7_a = dx_6_b * 1 + dx_2_a * 1  dx_7_b = dx_6_b * 1 + dx_2_a * 1  dx_8_b = -1 * np.sum(dx_7_b, axis=0)  dx_9_b = np.ones_like(x) / N * dx_8_b  dx_10 = dx_9_b + dx_7_a  dx = dx_10    '''  #dx = ((1 - 1.0/N)/np.sqrt(eps+var) - (x - mean)**2*(N-1)/(N**2)/(np.sqrt(eps+var)**3))*gamma*dout  dgamma = np.sum(x_hat*dout, axis=0)  dbeta = np.sum(dout, axis=0)    #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dx, dgamma, dbetadef batchnorm_backward_alt(dout, cache):  """  Alternative backward pass for batch normalization.  For this implementation you should work out the derivatives for the batch  normalizaton backward pass on paper and simplify as much as possible. You  should be able to derive a simple expression for the backward pass.  Note: This implementation should expect to receive the same cache variable  as batchnorm_backward, but might not use all of the values in the cache.  Inputs / outputs: Same as batchnorm_backward  """  dx, dgamma, dbeta = None, None, None  x, x_hat, mean, var, eps, gamma = cache  N, D = x.shape    #############################################################################  # TODO: Implement the backward pass for batch normalization. Store the      #  # results in the dx, dgamma, and dbeta variables.                           #  #                                                                           #  # After computing the gradient with respect to the centered inputs, you     #  # should be able to compute gradients with respect to the inputs in a       #  # single statement; our implementation fits on a single 80-character line.  #  #############################################################################  dx = ((1 - 1.0/N)/np.sqrt(eps+var) - (x - mean)**2*(N-1)/(N**2)/(np.sqrt(eps+var)**3))*gamma*dout  dgamma = np.sum(x_hat*dout, axis=0)  dbeta = np.sum(dout, axis=0)    #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dx, dgamma, dbetadef dropout_forward(x, dropout_param):  """  Performs the forward pass for (inverted) dropout.  Inputs:  - x: Input data, of any shape  - dropout_param: A dictionary with the following keys:    - p: Dropout parameter. We drop each neuron output with probability p.    - mode: 'test' or 'train'. If the mode is train, then perform dropout;      if the mode is test, then just return the input.    - seed: Seed for the random number generator. Passing seed makes this      function deterministic, which is needed for gradient checking but not in      real networks.  Outputs:  - out: Array of the same shape as x.  - cache: A tuple (dropout_param, mask). In training mode, mask is the dropout    mask that was used to multiply the input; in test mode, mask is None.  """  p, mode = dropout_param['p'], dropout_param['mode']  if 'seed' in dropout_param:    np.random.seed(dropout_param['seed'])  mask = None  out = None  if mode == 'train':    ###########################################################################    # TODO: Implement the training phase forward pass for inverted dropout.   #    # Store the dropout mask in the mask variable.                            #    ###########################################################################    mask = (np.random.rand(*x.shape)>p)/p    out = x*mask    ###########################################################################    #                            END OF YOUR CODE                             #    ###########################################################################  elif mode == 'test':    ###########################################################################    # TODO: Implement the test phase forward pass for inverted dropout.       #    ###########################################################################    out = x.copy()    ###########################################################################    #                            END OF YOUR CODE                             #    ###########################################################################  cache = (dropout_param, mask)  out = out.astype(x.dtype, copy=False)  return out, cachedef dropout_backward(dout, cache):  """  Perform the backward pass for (inverted) dropout.  Inputs:  - dout: Upstream derivatives, of any shape  - cache: (dropout_param, mask) from dropout_forward.  """  dropout_param, mask = cache  mode = dropout_param['mode']  dx = None  if mode == 'train':    ###########################################################################    # TODO: Implement the training phase backward pass for inverted dropout.  #    ###########################################################################    dx = dout*mask    ###########################################################################    #                            END OF YOUR CODE                             #    ###########################################################################  elif mode == 'test':    dx = dout  return dxdef conv_forward_naive(x, w, b, conv_param):  """  A naive implementation of the forward pass for a convolutional layer.  The input consists of N data points, each with C channels, height H and width  W. We convolve each input with F different filters, where each filter spans  all C channels and has height HH and width HH.  Input:  - x: Input data of shape (N, C, H, W)  - w: Filter weights of shape (F, C, HH, WW)  - b: Biases, of shape (F,)  - conv_param: A dictionary with the following keys:    - 'stride': The number of pixels between adjacent receptive fields in the      horizontal and vertical directions.    - 'pad': The number of pixels that will be used to zero-pad the input.  Returns a tuple of:  - out: Output data, of shape (N, F, H', W') where H' and W' are given by    H' = 1 + (H + 2 * pad - HH) / stride    W' = 1 + (W + 2 * pad - WW) / stride  - cache: (x, w, b, conv_param)  """  out = None  N, C, H, W = x.shape  F, _, HH, WW = w.shape  pad = conv_param['pad']  stride = conv_param['stride']    H_ = 1 + (H + 2 * pad - HH) / stride  W_ = 1 + (W + 2 * pad - WW) / stride    out = np.zeros((N, F, H_, W_))  #############################################################################  # TODO: Implement the convolutional forward pass.                           #  # Hint: you can use the function np.pad for padding.                        #  #############################################################################  for n in range(N):    for f in range(F):      for i, h in enumerate(range(0, H, stride)):        for j, w_ in enumerate(range(0, W, stride)):          for c in range(C):            padding_x = np.pad(x[n, c, :, :], pad_width=pad, mode = 'constant',constant_values=0)            out[n, f, i, j] += np.sum(padding_x[h:h+HH, w_:w_+WW]*w[f, c, :,:])          out[n, f, i, j] += b[f]  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  cache = (x, w, b, conv_param)  return out, cachedef conv_backward_naive(dout, cache):  """  A naive implementation of the backward pass for a convolutional layer.  Inputs:  - dout: Upstream derivatives.  - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive  Returns a tuple of:  - dx: Gradient with respect to x  - dw: Gradient with respect to w  - db: Gradient with respect to b  """  dx, dw, db = None, None, None  #############################################################################  # TODO: Implement the convolutional backward pass.                          #  #############################################################################  x, w, b, conv_param = cache  N, C, H, W = x.shape  F, _, HH, WW = w.shape  pad = conv_param['pad']  stride = conv_param['stride']    H_ = 1 + (H + 2 * pad - HH) / stride  W_ = 1 + (W + 2 * pad - WW) / stride    out = np.zeros((N, F, H_, W_))    dx_pad = np.zeros((N, C, H+2*pad, W+2*pad))  dw = np.zeros(w.shape)  db = np.zeros(b.shape)  for n in range(N):    for f in range(F):      for i, h in enumerate(range(0, H, stride)):        for j, w_ in enumerate(range(0, W, stride)):          db[f] += dout[n, f, i, j]          for c in range(C):              dx_pad[n, c, h:h+HH, w_:w_+WW] += w[f, c, :, :]*dout[n, f, i, j]            padding_x = np.pad(x[n, c, :, :], pad_width=pad, mode = 'constant',constant_values=0)            dw[f, c, :, :] += padding_x[h:h+HH, w_:w_+WW]*dout[n, f, i, j]  dx = dx_pad[:, :, 1:-1, 1:-1]  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dx, dw, dbdef max_pool_forward_naive(x, pool_param):  """  A naive implementation of the forward pass for a max pooling layer.  Inputs:  - x: Input data, of shape (N, C, H, W)  - pool_param: dictionary with the following keys:    - 'pool_height': The height of each pooling region    - 'pool_width': The width of each pooling region    - 'stride': The distance between adjacent pooling regions  Returns a tuple of:  - out: Output data  - cache: (x, pool_param)  """  out = None  N, C, H, W = x.shape  PH, PW, stride = pool_param['pool_height'], \    pool_param['pool_width'], pool_param['stride']  H_ = 1 + (H - PH) / stride  W_ = 1 + (W - PW) / stride   out = np.zeros((N, C, H_, W_))  #############################################################################  # TODO: Implement the max pooling forward pass                              #  #############################################################################  for n in range(N):    for i, h in enumerate(range(0, H, stride)):      for j, w_ in enumerate(range(0, W, stride)):        for c in range(C):          out[n, c, i, j] = np.max(x[n, c, h:h+PH, w_:w_+PW])  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  cache = (x, pool_param)  return out, cachedef max_pool_backward_naive(dout, cache):  """  A naive implementation of the backward pass for a max pooling layer.  Inputs:  - dout: Upstream derivatives  - cache: A tuple of (x, pool_param) as in the forward pass.  Returns:  - dx: Gradient with respect to x  """  dx = None  x, pool_param = cache  N, C, H, W = x.shape  PH, PW, stride = pool_param['pool_height'], \  pool_param['pool_width'], pool_param['stride']  H_ = 1 + (H - PH) / stride  W_ = 1 + (W - PW) / stride    dx = np.zeros(x.shape)  #############################################################################  # TODO: Implement the max pooling backward pass                             #  #############################################################################  for n in range(N):    for i, h in enumerate(range(0, H, stride)):      for j, w_ in enumerate(range(0, W, stride)):        for c in range(C):          part_x = x[n,c,h:h+PH, w_:w_+PW]          dx[n, c, h:h+PH, w_:w_+PW] +=  (part_x== np.max(part_x))*dout[n,c,i,j]  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dxdef spatial_batchnorm_forward(x, gamma, beta, bn_param):  """  Computes the forward pass for spatial batch normalization.  Inputs:  - x: Input data of shape (N, C, H, W)  - gamma: Scale parameter, of shape (C,)  - beta: Shift parameter, of shape (C,)  - bn_param: Dictionary with the following keys:    - mode: 'train' or 'test'; required    - eps: Constant for numeric stability    - momentum: Constant for running mean / variance. momentum=0 means that      old information is discarded completely at every time step, while      momentum=1 means that new information is never incorporated. The      default of momentum=0.9 should work well in most situations.    - running_mean: Array of shape (D,) giving running mean of features    - running_var Array of shape (D,) giving running variance of features  Returns a tuple of:  - out: Output data, of shape (N, C, H, W)  - cache: Values needed for the backward pass  """  out, cache = None, None  N, C, H, W = x.shape  #############################################################################  # TODO: Implement the forward pass for spatial batch normalization.         #  #                                                                           #  # HINT: You can implement spatial batch normalization using the vanilla     #  # version of batch normalization defined above. Your implementation should  #  # be very short; ours is less than five lines.                              #  #############################################################################  out, cache = batchnorm_forward(x.reshape(N*H*W, C), gamma, beta, bn_param)  out = out.reshape(x.shape)  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return out, cachedef spatial_batchnorm_backward(dout, cache):  """  Computes the backward pass for spatial batch normalization.  Inputs:  - dout: Upstream derivatives, of shape (N, C, H, W)  - cache: Values from the forward pass  Returns a tuple of:  - dx: Gradient with respect to inputs, of shape (N, C, H, W)  - dgamma: Gradient with respect to scale parameter, of shape (C,)  - dbeta: Gradient with respect to shift parameter, of shape (C,)  """  dx, dgamma, dbeta = None, None, None  N, C, H, W = dout.shape  #############################################################################  # TODO: Implement the backward pass for spatial batch normalization.        #  #                                                                           #  # HINT: You can implement spatial batch normalization using the vanilla     #  # version of batch normalization defined above. Your implementation should  #  # be very short; ours is less than five lines.                              #  #############################################################################  dx, dgamma, dbeta = batchnorm_backward(dout.reshape(N*H*W, C), cache)  dx = dx.reshape(dout.shape)  dgamma = dgamma.reshape(C)  dbeta = dbeta.reshape(C)  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return dx, dgamma, dbetadef svm_loss(x, y):  """  Computes the loss and gradient using for multiclass SVM classification.  Inputs:  - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class    for the ith input.  - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and    0 <= y[i] < C  Returns a tuple of:  - loss: Scalar giving the loss  - dx: Gradient of the loss with respect to x  """  N = x.shape[0]  correct_class_scores = x[np.arange(N), y]  margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)  margins[np.arange(N), y] = 0  loss = np.sum(margins) / N  num_pos = np.sum(margins > 0, axis=1)  dx = np.zeros_like(x)  dx[margins > 0] = 1  dx[np.arange(N), y] -= num_pos  dx /= N  return loss, dxdef softmax_loss(x, y):  """  Computes the loss and gradient for softmax classification.  Inputs:  - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class    for the ith input.  - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and    0 <= y[i] < C  Returns a tuple of:  - loss: Scalar giving the loss  - dx: Gradient of the loss with respect to x  """  probs = np.exp(x - np.max(x, axis=1, keepdims=True))  probs /= np.sum(probs, axis=1, keepdims=True)  N = x.shape[0]  loss = -np.sum(np.log(probs[np.arange(N), y])) / N  dx = probs.copy()  dx[np.arange(N), y] -= 1  dx /= N  return loss, dx

另外，还可以对各个层进行组合，方便网络的构建(CS231n layers_util.py的实现)

from cs231n.layers import *from cs231n.fast_layers import *def affine_relu_forward(x, w, b):  """  Convenience layer that perorms an affine transform followed by a ReLU  Inputs:  - x: Input to the affine layer  - w, b: Weights for the affine layer  Returns a tuple of:  - out: Output from the ReLU  - cache: Object to give to the backward pass  """  a, fc_cache = affine_forward(x, w, b)  out, relu_cache = relu_forward(a)  cache = (fc_cache, relu_cache)  return out, cachedef affine_relu_backward(dout, cache):  """  Backward pass for the affine-relu convenience layer  """  fc_cache, relu_cache = cache  da = relu_backward(dout, relu_cache)  dx, dw, db = affine_backward(da, fc_cache)  return dx, dw, dbpassdef conv_relu_forward(x, w, b, conv_param):  """  A convenience layer that performs a convolution followed by a ReLU.  Inputs:  - x: Input to the convolutional layer  - w, b, conv_param: Weights and parameters for the convolutional layer  Returns a tuple of:  - out: Output from the ReLU  - cache: Object to give to the backward pass  """  a, conv_cache = conv_forward_fast(x, w, b, conv_param)  out, relu_cache = relu_forward(a)  cache = (conv_cache, relu_cache)  return out, cachedef conv_relu_backward(dout, cache):  """  Backward pass for the conv-relu convenience layer.  """  conv_cache, relu_cache = cache  da = relu_backward(dout, relu_cache)  dx, dw, db = conv_backward_fast(da, conv_cache)  return dx, dw, dbdef conv_relu_pool_forward(x, w, b, conv_param, pool_param):  """  Convenience layer that performs a convolution, a ReLU, and a pool.  Inputs:  - x: Input to the convolutional layer  - w, b, conv_param: Weights and parameters for the convolutional layer  - pool_param: Parameters for the pooling layer  Returns a tuple of:  - out: Output from the pooling layer  - cache: Object to give to the backward pass  """  a, conv_cache = conv_forward_fast(x, w, b, conv_param)  s, relu_cache = relu_forward(a)  out, pool_cache = max_pool_forward_fast(s, pool_param)  cache = (conv_cache, relu_cache, pool_cache)  return out, cachedef conv_relu_pool_backward(dout, cache):  """  Backward pass for the conv-relu-pool convenience layer  """  conv_cache, relu_cache, pool_cache = cache  ds = max_pool_backward_fast(dout, pool_cache)  da = relu_backward(ds, relu_cache)  dx, dw, db = conv_backward_fast(da, conv_cache)  return dx, dw, db

CNN网络的实现

layers层实现完毕后，CNN的构建也就很简单了，一般来说CNN网络的构建使用类来实现，其中构造函数的形式：

  def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0, dtype=np.float32, Use_bn = False):

其次类中还应该包含loss函数, 若y为None，则返回scores，否则返回loss和Grads：

  def loss(self, X,y=None):

其中CS231n实现的3层网络如下：

import numpy as npfrom cs231n.layers import *from cs231n.fast_layers import *from cs231n.layer_utils import *class ThreeLayerConvNet(object):  """  A three-layer convolutional network with the following architecture:  conv - relu - 2x2 max pool - affine - relu - affine - softmax  The network operates on minibatches of data that have shape (N, C, H, W)  consisting of N images, each with height H and width W and with C input  channels.  """  def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,               hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,               dtype=np.float32, Use_bn = False):    """    Initialize a new network.    Inputs:    - input_dim: Tuple (C, H, W) giving size of input data    - num_filters: Number of filters to use in the convolutional layer    - filter_size: Size of filters to use in the convolutional layer    - hidden_dim: Number of units to use in the fully-connected hidden layer    - num_classes: Number of scores to produce from the final affine layer.    - weight_scale: Scalar giving standard deviation for random initialization      of weights.    - reg: Scalar giving L2 regularization strength    - dtype: numpy datatype to use for computation.    """    self.params = {}    self.reg = reg    self.dtype = dtype    self.Use_bn = Use_bn    C, H, W = input_dim    ############################################################################    # TODO: Initialize weights and biases for the three-layer convolutional    #    # network. Weights should be initialized from a Gaussian with standard     #    # deviation equal to weight_scale; biases should be initialized to zero.   #    # All weights and biases should be stored in the dictionary self.params.   #    # Store weights and biases for the convolutional layer using the keys 'W1' #    # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the       #    # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases   #    # of the output affine layer.                                              #    ############################################################################    #conv - relu - 2x2 max pool - affine - relu - affine - softmax    self.params['W1'] = np.random.randn(num_filters, C, filter_size, filter_size)*weight_scale    self.params['W2'] = np.random.randn(H/2*W/2*num_filters, hidden_dim)*weight_scale    self.params['W3'] = np.random.randn(hidden_dim, num_classes)*weight_scale    self.params['b1'] = np.zeros(num_filters)    self.params['b2'] = np.zeros(hidden_dim)    self.params['b3'] = np.zeros(num_classes)    self.params['gamma1'] = np.ones(num_filters)    self.params['beta1'] = np.zeros(num_filters)    self.params['gamma2'] = np.ones(hidden_dim)    self.params['beta2'] = np.zeros(hidden_dim)         ############################################################################    #                             END OF YOUR CODE                             #    ############################################################################    for k, v in self.params.iteritems():      self.params[k] = v.astype(dtype)  def loss(self, X,y=None):    """    Evaluate loss and gradient for the three-layer convolutional network.    Input / output: Same API as TwoLayerNet in fc_net.py.    """    W1, b1 = self.params['W1'], self.params['b1']    W2, b2 = self.params['W2'], self.params['b2']    W3, b3 = self.params['W3'], self.params['b3']    gamma1, gamma2 = self.params['gamma1'], self.params['gamma2']    beta1, beta2 = self.params['beta1'], self.params['beta2']    # pass conv_param to the forward pass for the convolutional layer    filter_size = W1.shape[2]    conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}    # pass pool_param to the forward pass for the max-pooling layer    pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}    bn_params1, bn_params2 = [{'mode': 'train'} ,{'mode': 'train'} ]    scores = None    ############################################################################    # TODO: Implement the forward pass for the three-layer convolutional net,  #    # computing the class scores for X and storing them in the scores          #    # variable.                                                                #    ############################################################################    #conv - relu - 2x2 max pool - affine - relu - affine - softmax    out_conv, cache_conv = conv_forward_im2col(X, W1, b1, conv_param)    out_conv, cache_sbn = spatial_batchnorm_forward(out_conv, gamma1, beta1, bn_params1)    out_relu1, cache_relu1 = relu_forward(out_conv)    out_pool, cache_pool = max_pool_forward_fast(out_relu1, pool_param)    out_affine1 , cache_affine1 = affine_forward(out_pool, W2, b2)    out_affine1, cache_bn = batchnorm_forward(out_affine1, gamma2, beta2, bn_params2)    out_relu2, cache_relu2 = relu_forward(out_affine1)    scores , cache_affine2 = affine_forward(out_relu2, W3, b3)    ############################################################################    #                             END OF YOUR CODE                             #    ############################################################################    if y is None:      return scores    loss, grads = 0, {}    ############################################################################    # TODO: Implement the backward pass for the three-layer convolutional net, #    # storing the loss and gradients in the loss and grads variables. Compute  #    # data loss using softmax, and make sure that grads[k] holds the gradients #    # for self.params[k]. Don't forget to add L2 regularization!               #    ############################################################################    loss, dx = softmax_loss(scores, y)    dx, grads['W3'], grads['b3'] = affine_backward(dx, cache_affine2)    dx = relu_backward(dx, cache_relu2)    dx, grads['gamma2'], grads['beta2'] =  batchnorm_backward(dx, cache_bn)    dx, grads['W2'], grads['b2'] = affine_backward(dx, cache_affine1)    dx = max_pool_backward_fast(dx, cache_pool)    dx = relu_backward(dx, cache_relu1)    dx, grads['gamma1'], grads['beta1'] =  spatial_batchnorm_backward(dx, cache_sbn)    _, grads['W1'], grads['b1'] = conv_backward_im2col(dx, cache_conv)    loss += 0.5*self.reg*(np.sum(W1**2)+np.sum(W2**2)+np.sum(W3**2))    grads['W1'] += W1*self.reg;    grads['W2'] += W2*self.reg;    grads['W3'] += W3*self.reg;    ############################################################################    #                             END OF YOUR CODE                             #    ############################################################################    return loss, grads  pass

优化方法

优化方法有很多种，主要的实现如下, 即输入w和梯度dw，以及一些配置参数config：

def sgd_momentum(w, dw, config=None):    ...    return next_w, config

cs231n中这些方法存在于 optim.py中：

import numpy as np"""This file implements various first-order update rules that are commonly used fortraining neural networks. Each update rule accepts current weights and thegradient of the loss with respect to those weights and produces the next set ofweights. Each update rule has the same interface:def update(w, dw, config=None):Inputs:  - w: A numpy array giving the current weights.  - dw: A numpy array of the same shape as w giving the gradient of the    loss with respect to w.  - config: A dictionary containing hyperparameter values such as learning rate,    momentum, etc. If the update rule requires caching values over many    iterations, then config will also hold these cached values.Returns:  - next_w: The next point after the update.  - config: The config dictionary to be passed to the next iteration of the    update rule.NOTE: For most update rules, the default learning rate will probably not performwell; however the default values of the other hyperparameters should work wellfor a variety of different problems.For efficiency, update rules may perform in-place updates, mutating w andsetting next_w equal to w."""def sgd(w, dw, config=None):  """  Performs vanilla stochastic gradient descent.  config format:  - learning_rate: Scalar learning rate.  """  if config is None: config = {}  config.setdefault('learning_rate', 1e-2)  w -= config['learning_rate'] * dw  return w, configdef sgd_momentum(w, dw, config=None):  """  Performs stochastic gradient descent with momentum.  config format:  - learning_rate: Scalar learning rate.  - momentum: Scalar between 0 and 1 giving the momentum value.    Setting momentum = 0 reduces to sgd.  - velocity: A numpy array of the same shape as w and dw used to store a moving    average of the gradients.  """  if config is None: config = {}  config.setdefault('learning_rate', 1e-2)  config.setdefault('momentum', 0.9)  v = config.get('velocity', np.zeros_like(w))  next_w = None  #############################################################################  # TODO: Implement the momentum update formula. Store the updated value in   #  # the next_w variable. You should also use and update the velocity v.       #  #############################################################################  v = config['momentum']*v-dw*config['learning_rate']  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  config['velocity'] = v  next_w = w + v  return next_w, configdef rmsprop(x, dx, config=None):  """  Uses the RMSProp update rule, which uses a moving average of squared gradient  values to set adaptive per-parameter learning rates.  config format:  - learning_rate: Scalar learning rate.  - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared    gradient cache.  - epsilon: Small scalar used for smoothing to avoid dividing by zero.  - cache: Moving average of second moments of gradients.  """  if config is None: config = {}  config.setdefault('learning_rate', 1e-2)  config.setdefault('decay_rate', 0.99)  config.setdefault('epsilon', 1e-8)  config.setdefault('cache', np.zeros_like(x))  next_x = None  #############################################################################  # TODO: Implement the RMSprop update formula, storing the next value of x   #  # in the next_x variable. Don't forget to update cache value stored in      #    # config['cache'].                                                          #  #############################################################################  cache = config['cache']  decay_rate = config['decay_rate']  eps = config['epsilon']  learning_rate = config['learning_rate']  cache = decay_rate * cache + (1 - decay_rate) * dx**2  next_x = x - learning_rate * dx / (np.sqrt(cache) + eps)    config['cache'] = cache  #############################################################################  #                             END OF YOUR CODE                              #  #############################################################################  return next_x, configdef adam(x, dx, config=None):  """  Uses the Adam update rule, which incorporates moving averages of both the  gradient and its square and a bias correction term.  config format:  - learning_rate: Scalar learning rate.  - beta1: Decay rate for moving average of first moment of gradient.  - beta2: Decay rate for moving average of second moment of gradient.  - epsilon: Small scalar used for smoothing to avoid dividing by zero.  - m: Moving average of gradient.  - v: Moving average of squared gradient.  - t: Iteration number.  """  if config is None: config = {}  config.setdefault('learning_rate', 1e-3)  config.setdefault('beta1', 0.9)  config.setdefault('beta2', 0.999)  config.setdefault('epsilon', 1e-8)  config.setdefault('m', np.zeros_like(x))  config.setdefault('v', np.zeros_like(x))  config.setdefault('t', 0)  next_x = None  ################################################################  # TODO: Implement the Adam update formula, storing the next value of x in   #  # the next_x variable. Don't forget to update the m, v, and t variables     #  # stored in config.                                                           ################################################################  m = config['m']  v = config['v']  beta1 = config['beta1']  beta2 = config['beta2']  learning_rate = config['learning_rate']  eps = config['epsilon']  t = config['t'] + 1  m = beta1*m + (1-beta1)*dx  v = beta2*v + (1-beta2)*(dx**2)     mb = m*1.0/(1-beta1**t)  vb = v*1.0/(1-beta2**t)  x = x - learning_rate * mb / (np.sqrt(vb) + eps)  next_x = x    config['m'] = m  config['v'] = v  config['t'] = t  ################################################################                           END OF YOUR CODE                                ################################################################  return next_x, config

网络训练

cs231n中集中在solver类中：
构造函数：

  def __init__(self, model, data, **kwargs)  def _reset(self):  def _step(self):  def check_accuracy(self, X, y, num_samples=None, batch_size=100)  def train(self):

主要的思路是_reset中重置一些存储数据（比如训练过程的loss、accuracy等）的变量，_step中做的是每一步的迭代优化, check_accuracy是每一个epoch结束后对精度进行预测和存储，train过长包含整个训练过程

import numpy as npfrom cs231n import optimclass Solver(object):  """  A Solver encapsulates all the logic necessary for training classification  models. The Solver performs stochastic gradient descent using different  update rules defined in optim.py.  The solver accepts both training and validataion data and labels so it can  periodically check classification accuracy on both training and validation  data to watch out for overfitting.  To train a model, you will first construct a Solver instance, passing the  model, dataset, and various optoins (learning rate, batch size, etc) to the  constructor. You will then call the train() method to run the optimization  procedure and train the model.  After the train() method returns, model.params will contain the parameters  that performed best on the validation set over the course of training.  In addition, the instance variable solver.loss_history will contain a list  of all losses encountered during training and the instance variables  solver.train_acc_history and solver.val_acc_history will be lists containing  the accuracies of the model on the training and validation set at each epoch.  Example usage might look something like this:  data = {    'X_train': # training data    'y_train': # training labels    'X_val': # validation data    'X_train': # validation labels  }  model = MyAwesomeModel(hidden_size=100, reg=10)  solver = Solver(model, data,                  update_rule='sgd',                  optim_config={                    'learning_rate': 1e-3,                  },                  lr_decay=0.95,                  num_epochs=10, batch_size=100,                  print_every=100)  solver.train()  A Solver works on a model object that must conform to the following API:  - model.params must be a dictionary mapping string parameter names to numpy    arrays containing parameter values.  - model.loss(X, y) must be a function that computes training-time loss and    gradients, and test-time classification scores, with the following inputs    and outputs:    Inputs:    - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)    - y: Array of labels, of shape (N,) giving labels for X where y[i] is the      label for X[i].    Returns:    If y is None, run a test-time forward pass and return:    - scores: Array of shape (N, C) giving classification scores for X where      scores[i, c] gives the score of class c for X[i].    If y is not None, run a training time forward and backward pass and return    a tuple of:    - loss: Scalar giving the loss    - grads: Dictionary with the same keys as self.params mapping parameter      names to gradients of the loss with respect to those parameters.  """  def __init__(self, model, data, **kwargs):    """    Construct a new Solver instance.    Required arguments:    - model: A model object conforming to the API described above    - data: A dictionary of training and validation data with the following:      'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images      'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images      'y_train': Array of shape (N_train,) giving labels for training images      'y_val': Array of shape (N_val,) giving labels for validation images    Optional arguments:    - update_rule: A string giving the name of an update rule in optim.py.      Default is 'sgd'.    - optim_config: A dictionary containing hyperparameters that will be      passed to the chosen update rule. Each update rule requires different      hyperparameters (see optim.py) but all update rules require a      'learning_rate' parameter so that should always be present.    - lr_decay: A scalar for learning rate decay; after each epoch the learning      rate is multiplied by this value.    - batch_size: Size of minibatches used to compute loss and gradient during      training.    - num_epochs: The number of epochs to run for during training.    - print_every: Integer; training losses will be printed every print_every      iterations.    - verbose: Boolean; if set to false then no output will be printed during      training.    """    self.model = model    self.X_train = data['X_train']    self.y_train = data['y_train']    self.X_val = data['X_val']    self.y_val = data['y_val']    # Unpack keyword arguments    self.update_rule = kwargs.pop('update_rule', 'sgd')    self.optim_config = kwargs.pop('optim_config', {})    self.lr_decay = kwargs.pop('lr_decay', 1.0)    self.batch_size = kwargs.pop('batch_size', 100)    self.num_epochs = kwargs.pop('num_epochs', 10)    self.print_every = kwargs.pop('print_every', 10)    self.verbose = kwargs.pop('verbose', True)    # Throw an error if there are extra keyword arguments    if len(kwargs) > 0:      extra = ', '.join('"%s"' % k for k in kwargs.keys())      raise ValueError('Unrecognized arguments %s' % extra)    # Make sure the update rule exists, then replace the string    # name with the actual function    if not hasattr(optim, self.update_rule):      raise ValueError('Invalid update_rule "%s"' % self.update_rule)    self.update_rule = getattr(optim, self.update_rule)    self._reset()  def _reset(self):    """    Set up some book-keeping variables for optimization. Don't call this    manually.    """    # Set up some variables for book-keeping    self.epoch = 0    self.best_val_acc = 0    self.best_params = {}    self.loss_history = []    self.train_acc_history = []    self.val_acc_history = []    # Make a deep copy of the optim_config for each parameter    self.optim_configs = {}    for p in self.model.params:      d = {k: v for k, v in self.optim_config.iteritems()}      self.optim_configs[p] = d  def _step(self):    """    Make a single gradient update. This is called by train() and should not    be called manually.    """    # Make a minibatch of training data    num_train = self.X_train.shape[0]    batch_mask = np.random.choice(num_train, self.batch_size)    X_batch = self.X_train[batch_mask]    y_batch = self.y_train[batch_mask]    # Compute loss and gradient    loss, grads = self.model.loss(X_batch, y_batch)    self.loss_history.append(loss)    # Perform a parameter update    for p, w in self.model.params.iteritems():      dw = grads[p]      config = self.optim_configs[p]      next_w, next_config = self.update_rule(w, dw, config)      self.model.params[p] = next_w      self.optim_configs[p] = next_config  def check_accuracy(self, X, y, num_samples=None, batch_size=100):    """    Check accuracy of the model on the provided data.    Inputs:    - X: Array of data, of shape (N, d_1, ..., d_k)    - y: Array of labels, of shape (N,)    - num_samples: If not None, subsample the data and only test the model      on num_samples datapoints.    - batch_size: Split X and y into batches of this size to avoid using too      much memory.    Returns:    - acc: Scalar giving the fraction of instances that were correctly      classified by the model.    """    # Maybe subsample the data    N = X.shape[0]    if num_samples is not None and N > num_samples:      mask = np.random.choice(N, num_samples)      N = num_samples      X = X[mask]      y = y[mask]    # Compute predictions in batches    num_batches = N / batch_size    if N % batch_size != 0:      num_batches += 1    y_pred = []    for i in xrange(num_batches):      start = i * batch_size      end = (i + 1) * batch_size      scores = self.model.loss(X[start:end])      y_pred.append(np.argmax(scores, axis=1))    y_pred = np.hstack(y_pred)    acc = np.mean(y_pred == y)    return acc  def train(self):    """    Run optimization to train the model.    """    num_train = self.X_train.shape[0]    iterations_per_epoch = max(num_train / self.batch_size, 1)    num_iterations = self.num_epochs * iterations_per_epoch    for t in xrange(num_iterations):      self._step()      # Maybe print training loss      if self.verbose and t % self.print_every == 0:        print '(Iteration %d / %d) loss: %f' % (               t + 1, num_iterations, self.loss_history[-1])      # At the end of every epoch, increment the epoch counter and decay the      # learning rate.      epoch_end = (t + 1) % iterations_per_epoch == 0      if epoch_end:        self.epoch += 1        for k in self.optim_configs:          self.optim_configs[k]['learning_rate'] *= self.lr_decay      # Check train and val accuracy on the first iteration, the last      # iteration, and at the end of each epoch.      first_it = (t == 0)      last_it = (t == num_iterations + 1)      if first_it or last_it or epoch_end:        train_acc = self.check_accuracy(self.X_train, self.y_train,                                        num_samples=1000)        val_acc = self.check_accuracy(self.X_val, self.y_val)        self.train_acc_history.append(train_acc)        self.val_acc_history.append(val_acc)        if self.verbose:          print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (                 self.epoch, self.num_epochs, train_acc, val_acc)        # Keep track of the best model        if val_acc > self.best_val_acc:          self.best_val_acc = val_acc          self.best_params = {}          for k, v in self.model.params.iteritems():            self.best_params[k] = v.copy()    # At the end of training swap the best params into the model    self.model.params = self.best_params

0 0