CS231n (winter 2016) : Assignment2

来源:互联网 发布:网络数据保险箱 编辑:程序博客网 时间:2024/05/22 03:17




Part 1:深层全连接神经网络(python编程任务)


python def layer_forward(x, w): """ Receive inputs x and weights w """ # Do some computations ... z = # ... some intermediate value # Do some more computations ... out = # the output  cache = (x, w, z, out) # Values we need to compute gradients  return out, cache The backward pass will receive upstream derivatives and the cache object, and will return gradients with respect to the inputs and weights, like this:python def layer_backward(dout, cache): """ Receive derivative of loss with respect to outputs and cache, and compute derivative with respect to inputs. """ # Unpack cache values x, w, z, out = cache # Use values in cache to compute derivatives dx = # Derivative of loss with respect to x dw = # Derivative of loss with respect to w  return dx, dw

此外,我们会将前面学过的参数更新策略全部集成到模块中,这样我们可以探索不同的参数更新策略的性能表现;我们也会将Batch Normalization和Dropout应用到模块中,来更高效地优化深度网络。


1. 2-layer全连接神经网络

--> fc_net.py里的TwoLayerNet类
--> layers.py里的前四个函数
--> optim.py

---> fc_net.py

__coauthor__ = 'Deeplayer'# 6.22.2016 #from layer_utils import *class TwoLayerNet(object):       """        A two-layer fully-connected neural network with ReLU nonlinearity and        softmax loss that uses a modular layer design. We assume an input dimension        of D, a hidden dimension of H, and perform classification over C classes.        The architecure should be affine - relu - affine - softmax.        Note that this class does not implement gradient descent; instead, it        will interact with a separate Solver object that is responsible for running        optimization.        The learnable parameters of the model are stored in the dictionary        self.params that maps parameter names to numpy arrays.       """    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,                                         weight_scale=1e-3, reg=0.0):            """            Initialize a new network.           Inputs:            - input_dim: An integer giving the size of the input            - hidden_dim: An integer giving the size of the hidden layer            - num_classes: An integer giving the number of classes to classify            - dropout: Scalar between 0 and 1 giving dropout strength.            - weight_scale: Scalar giving the standard deviation for random                         initialization of the weights.            - reg: Scalar giving L2 regularization strength.            """            self.params = {}            self.reg = reg           self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)             self.params['b1'] = np.zeros((1, hidden_dim))            self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)          self.params['b2'] = np.zeros((1, num_classes))    def loss(self, X, y=None):            """           Compute loss and gradient for a minibatch of data.            Inputs:            - X: Array of input data of shape (N, d_1, ..., d_k)            - y: Array of labels, of shape (N,). y[i] gives the label for X[i].          Returns:           If y is None, then run a test-time forward pass of the model and return:            - scores: Array of shape (N, C) giving classification scores, where                                scores[i, c] is the classification score for X[i] and class c.         If y is not None, then run a training-time forward and backward pass and            return a tuple of:            - loss: Scalar value giving the loss           - grads: Dictionary with the same keys as self.params, mapping parameter                              names to gradients of the loss with respect to those parameters.            """        scores = None        N = X.shape[0]        # Unpack variables from the params dictionary        W1, b1 = self.params['W1'], self.params['b1']        W2, b2 = self.params['W2'], self.params['b2']        h1, cache1 = affine_relu_forward(X, W1, b1)        out, cache2 = affine_forward(h1, W2, b2)        scores = out              # (N,C)        # If y is None then we are in test mode so just return scores        if y is None:               return scores        loss, grads = 0, {}        data_loss, dscores = softmax_loss(scores, y)        reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)        loss = data_loss + reg_loss       # Backward pass: compute gradients       dh1, dW2, db2 = affine_backward(dscores, cache2)       dX, dW1, db1 = affine_relu_backward(dh1, cache1)       # Add the regularization gradient contribution       dW2 += self.reg * W2       dW1 += self.reg * W1       grads['W1'] = dW1       grads['b1'] = db1       grads['W2'] = dW2       grads['b2'] = db2       return loss, grads

---> layers.py

__coauthor__ = 'Deeplayer'# 6.22.2016 #import numpy as npdef affine_forward(x, w, b):       """        Computes the forward pass for an affine (fully-connected) layer.     The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N       examples, where each example x[i] has shape (d_1, ..., d_k). We will        reshape each input into a vector of dimension D = d_1 * ... * d_k, and        then transform it to an output vector of dimension M.        Inputs:        - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)        - w: A numpy array of weights, of shape (D, M)        - b: A numpy array of biases, of shape (M,)       Returns a tuple of:        - out: output, of shape (N, M)        - cache: (x, w, b)       """    out = None    # Reshape x into rows    N = x.shape[0]    x_row = x.reshape(N, -1)         # (N,D)    out = np.dot(x_row, w) + b       # (N,M)    cache = (x, w, b)        return out, cachedef affine_backward(dout, cache):       """        Computes the backward pass for an affine layer.        Inputs:        - dout: Upstream derivative, of shape (N, M)        - cache: Tuple of:     - x: Input data, of shape (N, d_1, ... d_k)        - w: Weights, of shape (D, M)        Returns a tuple of:       - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)        - dw: Gradient with respect to w, of shape (D, M)     - db: Gradient with respect to b, of shape (M,)        """        x, w, b = cache        dx, dw, db = None, None, None       dx = np.dot(dout, w.T)                       # (N,D)        dx = np.reshape(dx, x.shape)                 # (N,d1,...,d_k)       x_row = x.reshape(x.shape[0], -1)            # (N,D)        dw = np.dot(x_row.T, dout)                   # (D,M)        db = np.sum(dout, axis=0, keepdims=True)     # (1,M)        return dx, dw, dbdef relu_forward(x):       """        Computes the forward pass for a layer of rectified linear units (ReLUs).        Input:        - x: Inputs, of any shape        Returns a tuple of:        - out: Output, of the same shape as x        - cache: x        """       out = None        out = ReLU(x)        cache = x        return out, cachedef relu_backward(dout, cache):       """      Computes the backward pass for a layer of rectified linear units (ReLUs).       Input:        - dout: Upstream derivatives, of any shape        - cache: Input x, of same shape as dout        Returns:        - dx: Gradient with respect to x        """        dx, x = None, cache        dx = dout        dx[x <= 0] = 0        return dxdef svm_loss(x, y):       """        Computes the loss and gradient using for multiclass SVM classification.        Inputs:        - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class                  for the ith input.        - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and                  0 <= y[i] < C       Returns a tuple of:        - loss: Scalar giving the loss       - dx: Gradient of the loss with respect to x        """        N = x.shape[0]       correct_class_scores = x[np.arange(N), y]        margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)        margins[np.arange(N), y] = 0       loss = np.sum(margins) / N       num_pos = np.sum(margins > 0, axis=1)        dx = np.zeros_like(x)       dx[margins > 0] = 1        dx[np.arange(N), y] -= num_pos        dx /= N        return loss, dxdef softmax_loss(x, y):        """        Computes the loss and gradient for softmax classification.    Inputs:        - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class             for the ith input.        - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and                  0 <= y[i] < C       Returns a tuple of:        - loss: Scalar giving the loss        - dx: Gradient of the loss with respect to x       """        probs = np.exp(x - np.max(x, axis=1, keepdims=True))        probs /= np.sum(probs, axis=1, keepdims=True)        N = x.shape[0]       loss = -np.sum(np.log(probs[np.arange(N), y])) / N        dx = probs.copy()        dx[np.arange(N), y] -= 1        dx /= N        return loss, dxdef ReLU(x):        """ReLU non-linearity."""        return np.maximum(0, x)

---> optim.py

__coauthor__ = 'Deeplayer'# 6.22.2016 import numpy as npdef sgd(w, dw, config=None):        """        Performs vanilla stochastic gradient descent.        config format:        - learning_rate: Scalar learning rate.        """       if config is None: config = {}       config.setdefault('learning_rate', 1e-2)       w -= config['learning_rate'] * dw       return w, configdef sgd_momentum(w, dw, config=None):        """        Performs stochastic gradient descent with momentum.        config format:        - learning_rate: Scalar learning rate.        - momentum: Scalar between 0 and 1 giving the momentum value.                    Setting momentum = 0 reduces to sgd.        - velocity: A numpy array of the same shape as w and dw used to store a moving        average of the gradients.       """       if config is None: config = {}        config.setdefault('learning_rate', 1e-2)       config.setdefault('momentum', 0.9)        v = config.get('velocity', np.zeros_like(w))        next_w = None        v = config['momentum'] * v - config['learning_rate'] * dw        next_w = w + v        config['velocity'] = v        return next_w, configdef rmsprop(x, dx, config=None):        """        Uses the RMSProp update rule, which uses a moving average of squared gradient        values to set adaptive per-parameter learning rates.        config format:        - learning_rate: Scalar learning rate.        - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared                      gradient cache.        - epsilon: Small scalar used for smoothing to avoid dividing by zero.        - cache: Moving average of second moments of gradients.       """        if config is None: config = {}        config.setdefault('learning_rate', 1e-2)      config.setdefault('decay_rate', 0.99)        config.setdefault('epsilon', 1e-8)        config.setdefault('cache', np.zeros_like(x))        next_x = None        cache = config['cache']        decay_rate = config['decay_rate']        learning_rate = config['learning_rate']        epsilon = config['epsilon']        cache = decay_rate * cache + (1 - decay_rate) * (dx**2)        x += - learning_rate * dx / (np.sqrt(cache) + epsilon)      config['cache'] = cache        next_x = x        return next_x, configdef adam(x, dx, config=None):        """        Uses the Adam update rule, which incorporates moving averages of both the      gradient and its square and a bias correction term.        config format:        - learning_rate: Scalar learning rate.        - beta1: Decay rate for moving average of first moment of gradient.        - beta2: Decay rate for moving average of second moment of gradient.       - epsilon: Small scalar used for smoothing to avoid dividing by zero.        - m: Moving average of gradient.        - v: Moving average of squared gradient.        - t: Iteration number.       """        if config is None: config = {}        config.setdefault('learning_rate', 1e-3)        config.setdefault('beta1', 0.9)        config.setdefault('beta2', 0.999)        config.setdefault('epsilon', 1e-8)        config.setdefault('m', np.zeros_like(x))        config.setdefault('v', np.zeros_like(x))        config.setdefault('t', 0)       next_x = None        m = config['m']        v = config['v']        beta1 = config['beta1']        beta2 = config['beta2']        learning_rate = config['learning_rate']        epsilon = config['epsilon']       t = config['t']        t += 1        m = beta1 * m + (1 - beta1) * dx        v = beta2 * v + (1 - beta2) * (dx**2)        m_bias = m / (1 - beta1**t)        v_bias = v / (1 - beta2**t)        x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)        next_x = x        config['m'] = m        config['v'] = v        config['t'] = t        return next_x, config


---> two_layer_fc_net_start.py

__coauthor__ = 'Deeplayer'# 6.22.2016import matplotlib.pyplot as pltfrom fc_net import *from data_utils import get_CIFAR10_datafrom solver import Solverdata = get_CIFAR10_data()model = TwoLayerNet(reg=0.9)solver = Solver(model, data,                                lr_decay=0.95,                                print_every=100, num_epochs=40, batch_size=400,                 update_rule='sgd_momentum',                                optim_config={'learning_rate': 5e-4, 'momentum': 0.5})solver.train()                 plt.subplot(2, 1, 1)plt.title('Training loss')plt.plot(solver.loss_history, 'o')plt.xlabel('Iteration')plt.subplot(2, 1, 2)plt.title('Accuracy')plt.plot(solver.train_acc_history, '-o', label='train')plt.plot(solver.val_acc_history, '-o', label='val')plt.plot([0.5] * len(solver.val_acc_history), 'k--')plt.xlabel('Epoch')plt.legend(loc='lower right')plt.gcf().set_size_inches(15, 12)plt.show()best_model = modely_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()# Validation set accuracy:  about 52.9%# Test set accuracy:  about 54.7%# Visualize the weights of the best networkfrom vis_utils import visualize_griddef show_net_weights(net):        W1 = net.params['W1']        W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)        plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))       plt.gca().axis('off')    plt.show()show_net_weights(best_model)

2. Multilayer全连接网络 + Batch Normalization

--> fc_net.py 里的 FullyConnectedNet类
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函数

---> fc_net.py

__coauthor__ = 'Deeplayer'# 6.22.2016from layer_utils import *class FullyConnectedNet(object):        """        A fully-connected neural network with an arbitrary number of hidden layers,        ReLU nonlinearities, and a softmax loss function. This will also implement        dropout and batch normalization as options. For a network with L layers,        the architecture will be        {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax        where batch normalization and dropout are optional, and the {...} block is        repeated L - 1 times.       Similar to the TwoLayerNet above, learnable parameters are stored in the        self.params dictionary and will be learned using the Solver class.     def __init__(self, hidden_dims, input_dim=3*32*32,                   num_classes=10,                               dropout=0, use_batchnorm=False, reg=0.0,                     weight_scale=1e-2, dtype=np.float32, seed=None):        """    def __init__(self, hidden_dims, input_dim=3*32*32,                  num_classes=10,                            dropout=0, use_batchnorm=False, reg=0.0,                       weight_scale=1e-2, dtype=np.float32, seed=None):        self.use_batchnorm = use_batchnorm        self.use_dropout = dropout > 0        self.reg = reg        self.num_layers = 1 + len(hidden_dims)        self.dtype = dtype        self.params = {}        layers_dims = [input_dim] + hidden_dims + [num_classes]        for i in xrange(self.num_layers):                self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])                self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))                if self.use_batchnorm and i < len(hidden_dims):                 self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))                        self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))        # When using dropout we need to pass a dropout_param dictionary to each        # dropout layer so that the layer knows the dropout probability and the mode        # (train / test). You can pass the same dropout_param to each dropout layer.        self.dropout_param = {}        if self.use_dropout:                self.dropout_param = {'mode': 'train', 'p': dropout}                if seed is not None:                        self.dropout_param['seed'] = seed        # With batch normalization we need to keep track of running means and        # variances, so we need to pass a special bn_param object to each batch        # normalization layer. You should pass self.bn_params[0] to the forward pass        # of the first batch normalization layer, self.bn_params[1] to the forward        # pass of the second batch normalization layer, etc.        self.bn_params = []        if self.use_batchnorm:                self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]        # Cast all parameters to the correct datatype        for k, v in self.params.iteritems():                self.params[k] = v.astype(dtype)    def loss(self, X, y=None):            """            Compute loss and gradient for the fully-connected net.            Input / output: Same as TwoLayerNet above.            """            X = X.astype(self.dtype)            mode = 'test' if y is None else 'train'            # Set train/test mode for batchnorm params and dropout param since they            # behave differently during training and testing.            if self.dropout_param is not None:             self.dropout_param['mode'] = mode            if self.use_batchnorm:                for bn_param in self.bn_params:                        bn_param['mode'] = mode            scores = None            h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}            out[0] = X        # Forward pass: compute loss        for i in xrange(self.num_layers-1):                # Unpack variables from the params dictionary                W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]            if self.use_batchnorm:                        gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]                        h[i], cache1[i] = affine_forward(out[i], W, b)                        bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])                        out[i+1], cache3[i] = relu_forward(bn[i])                else:                        out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]        scores, cache = affine_forward(out[self.num_layers-1], W, b)        # If test mode return early        if mode == 'test':               return scores        loss, reg_loss, grads = 0.0, 0.0, {}        data_loss, dscores = softmax_loss(scores, y)        for i in xrange(self.num_layers):                reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])        loss = data_loss + reg_loss        # Backward pass: compute gradients        dout, dbn, dh = {}, {}, {}        t = self.num_layers-1        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)        for i in xrange(t):                if self.use_batchnorm:                        dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])                 dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])                       dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])                else:                        dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])        # Add the regularization gradient contribution        for i in xrange(self.num_layers):                grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]        return loss, grads

在给出 batchnorm_forward 和 batchnorm_backward函数代码之前,先给出Batch Normalization的算法和反向求导公式:

Batch Normalization, algorithm1.png
Batch Normalization, algorithm1.png
Backpropagate the gradient of loss ℓ .png
Backpropagate the gradient of loss ℓ .png

---> layers.py

__coauthor__ = 'Deeplayer'# 6.22.2016 import numpy as npdef batchnorm_forward(x, gamma, beta, bn_param):    mode = bn_param['mode']    eps = bn_param.get('eps', 1e-5)    momentum = bn_param.get('momentum', 0.9)    N, D = x.shape    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))    out, cache = None, None    if mode == 'train':            sample_mean = np.mean(x, axis=0, keepdims=True)       # [1,D]            sample_var = np.var(x, axis=0, keepdims=True)         # [1,D]         x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps)    # [N,D]            out = gamma * x_normalized + beta            cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)            running_mean = momentum * running_mean + (1 - momentum) * sample_mean            running_var = momentum * running_var + (1 - momentum) * sample_var    elif mode == 'test':            x_normalized = (x - running_mean) / np.sqrt(running_var + eps)            out = gamma * x_normalized + beta    else:            raise ValueError('Invalid forward batchnorm mode "%s"' % mode)    # Store the updated running means back into bn_param    bn_param['running_mean'] = running_mean    bn_param['running_var'] = running_var    return out, cachedef batchnorm_backward(dout, cache):    dx, dgamma, dbeta = None, None, None    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache    N, D = x.shape    dx_normalized = dout * gamma       # [N,D]    x_mu = x - sample_mean             # [N,D]    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \                                                                   2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)    dx1 = dx_normalized * sample_std_inv    dx2 = 2.0/N * dsample_var * x_mu    dx = dx1 + dx2 + 1.0/N * dsample_mean    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)    dbeta = np.sum(dout, axis=0, keepdims=True)    return dx, dgamma, dbeta

完成编程后,我们可以用Batch Normalization.ipynb来check我们的code是否有误。下面我会给出在使用Batch Normalization的情况下,6-layer神经网络在CIFAR-10上的performance。可以预见,6-layer神经网络的performance应该不会比2-layer神经网络的performance好多少的(因为会存在我在Assignment1最后提到的问题1)。

在这之前,我们先来看看Batch Normalization对梯度消失现象的缓解能力怎样,同时给出在不同weight_scales下的情况。我们分别以sigmoid和ReLU作为为激活函数的6-layer神经网络为例,测试一下:

---> batchnorm_and_weight_scales.py

__coauthor__ = 'Deeplayer'# 6.22.2016 #from fc_net import *from solver import *import matplotlib.pyplot as pltfrom data_utils import get_CIFAR10_data# Load the (preprocessed) CIFAR10 data.data = get_CIFAR10_data()hidden_dims = [100, 100, 100, 100, 100]num_train = 5000small_data = {         'X_train': data['X_train'][:num_train],         'y_train': data['y_train'][:num_train],         'X_val': data['X_val'],         'y_val': data['y_val'],}bn_solvers = {}solvers = {}weight_scales = np.logspace(-4, 0, num=20)for i, weight_scale in enumerate(weight_scales):        print 'Running weight scale %d / %d' % (i + 1, len(weight_scales))     bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)        model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)        bn_solver = Solver(bn_model, small_data,                               num_epochs=10, batch_size=100,                                  update_rule='adam',                                         optim_config={'learning_rate': 1e-3, },                                         verbose=False, print_every=1000)        bn_solver.train()        bn_solvers[weight_scale] = bn_solver        solver = Solver(model, small_data,                                      num_epochs=10, batch_size=100,                          update_rule='adam',                                     optim_config={'learning_rate': 1e-3, },                      verbose=False, print_every=1000)        solver.train()        solvers[weight_scale] = solver# Plot results of weight scale experimentbest_train_accs, bn_best_train_accs = [], []best_val_accs, bn_best_val_accs = [], []final_train_loss, bn_final_train_loss = [], []for ws in weight_scales:     best_train_accs.append(max(solvers[ws].train_acc_history))    bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))      best_val_accs.append(max(solvers[ws].val_acc_history))      bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))      final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))      bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))plt.subplot(3, 1, 1)plt.title('Best val accuracy vs weight initialization scale')plt.xlabel('Weight initialization scale')plt.ylabel('Best val accuracy')plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')plt.legend(ncol=2, loc='lower right')plt.subplot(3, 1, 2)plt.title('Best train accuracy vs weight initialization scale')plt.xlabel('Weight initialization scale')plt.ylabel('Best training accuracy')plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')plt.legend(loc='upper left')plt.subplot(3, 1, 3)plt.title('Final training loss vs weight initialization scale')plt.xlabel('Weight initialization scale')plt.ylabel('Final training loss')plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')plt.legend(loc='upper left')plt.gcf().set_size_inches(10, 15)plt.show()
Activation Function: Sigmoid.png
Activation Function: Sigmoid.png
Activation Function: ReLU.png
Activation Function: ReLU.png


1)、Batch Normalization解决了困扰学术界十几年的sigmoid的过饱和问题(梯度消失问题),bravo!可能你觉得上面的结果不够直接,那么我贴一下每层的权重梯度值:

Left: without Batch Normalization  --- Right: with Batch Normalization
Left: without Batch Normalization --- Right: with Batch Normalization

3)、如果weight_scales选得好的话,当激活函数为ReLU时,Batch Normalization对识别率的提升并不多。

· Validation set accuracy: 0.554
· Test set accuracy: 0.54

3. Dropout

--> 修改fc_net.py,将dropout加进去
vlayers.py 里的 dropout_forward 和 dropout_backward函数


CS231n Convolutional Neural Networks for Visual Recognition.png
CS231n Convolutional Neural Networks for Visual Recognition.png



__coauthor__ = 'Deeplayer'# 6.22.2016 #    def loss(self, X, y=None):            """            Compute loss and gradient for the fully-connected net.            Input / output: Same as TwoLayerNet above.            """            X = X.astype(self.dtype)            mode = 'test' if y is None else 'train'            # Set train/test mode for batchnorm params and dropout param since they            # behave differently during training and testing.            if self.dropout_param is not None:             self.dropout_param['mode'] = mode            if self.use_batchnorm:                for bn_param in self.bn_params:                        bn_param['mode'] = mode            scores = None            h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}            out[0] = X        # Forward pass: compute loss        for i in xrange(self.num_layers-1):                # Unpack variables from the params dictionary                W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]            if self.use_batchnorm:                        gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]                        h[i], cache1[i] = affine_forward(out[i], W, b)                        bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])                        out[i+1], cache3[i] = relu_forward(bn[i])                if self.use_dropout:                        out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)             else:                        out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)                if self.use_dropout:                        out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]        scores, cache = affine_forward(out[self.num_layers-1], W, b)        # If test mode return early        if mode == 'test':               return scores        loss, reg_loss, grads = 0.0, 0.0, {}        data_loss, dscores = softmax_loss(scores, y)        for i in xrange(self.num_layers):                reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])        loss = data_loss + reg_loss        # Backward pass: compute gradients        dout, dbn, dh, ddrop = {}, {}, {}, {}        t = self.num_layers-1        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)        for i in xrange(t):                if self.use_batchnorm:                if self.use_dropout:                        ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])                        dout[t-i] = ddrop[t-1-i]                     dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])                 dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])                       dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])                else:                if self.use_dropout:                        ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])                        dout[t-i] = ddrop[t-1-i]                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])        # Add the regularization gradient contribution        for i in xrange(self.num_layers):                grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]        return loss, grads

---> layers.py 里的 dropout_forward 和 dropout_backward函数

__coauthor__ = 'Deeplayer'# 6.22.2016 #def dropout_forward(x, dropout_param):    p, mode = dropout_param['p'], dropout_param['mode']    if 'seed' in dropout_param:          np.random.seed(dropout_param['seed'])    mask = None    out = None    if mode == 'train':            mask = (np.random.rand(*x.shape) < p) / p            out = x * mask    elif mode == 'test':            out = x    cache = (dropout_param, mask)    out = out.astype(x.dtype, copy=False)    return out, cachedef dropout_backward(dout, cache):    dropout_param, mask = cache    mode = dropout_param['mode']    dx = None    if mode == 'train':            dx = dout * mask    elif mode == 'test':            dx = dout    return dx


Dropout vs Overfitting.png
Dropout vs Overfitting.png

Part 2:卷积神经网络(Convolutional Neural Networks, CNNs)

现在我们开始理解本课程的核心内容 —— 卷积神经网络,对于视觉识别任务,CNNs无疑是最出彩的。和我们前面讲过的全连接神经网络相比,CNNs的优越之处在哪呢?我觉得可以列出以下几点:

1)、它的权值共享以及局部(感受野,receptive field)连接的特点,使之更加类似生物神经网络,视觉皮层的神经元就是局部接受信息的(即这些神经元只响应某些特定感受野的刺激);
2)、在我们的图像比较大的情况下(如 96x96、224x224、384x384、512x512等),全连接神经网络将需要训练超大量的参数(权重和偏置),这不仅会使得计算变得非常耗时,还会导致更加严重的过拟合现象。而CNNs的权值共享和局部连接的特点,使得需要训练的参数锐减(指数级的);


CS231n Convolutional Neural Networks for Visual Recognition.png
CS231n Convolutional Neural Networks for Visual Recognition.png

1. 卷积层(Convolutional Layer)


CS231n Convolutional Neural Networks for Visual Recognition.gif
CS231n Convolutional Neural Networks for Visual Recognition.gif

动图中,你会发现图像外面多了一圈0,而且过滤器移动的步长(stride)为2。补零这个操作,我们称之为zero-padding。我们记补零的圈数为p,过滤器移动步长为s,那么计算输出卷积特征(convolved feature,或者叫activation map)边长的公式为: L=(input_dim-k+2p)/s+1,输出特征的维数则为LxLxn/c。zero-padding这个操作产生的原因是为了保证过滤器的滑动能从头到尾刚刚好,即保证上面的公式能够整除。上面的p,s和n是需要我们提前设定好的三个超参数。对于步长s的设定,s设定得越小,提取的信息就越丰富,但计算量会相对大一点;s设定得越大,计算量会相对小一点,但是提取的信息就少一些。s的通常选择是1。

---> PS: 卷积为什么work?

2. 池化层(Pooling Layer)

卷积层的下一层是池化层,但要注意,卷积层的输出会经过激活函数(如ReLU)激活后,进入池化层。池化层的作用是将卷积层输出的维数进一步降低,以此来减少参数的数量和计算量。具体来讲,是将卷积层得到的结果无重合的分成几个子区域,然后选择每一子区域的最大值,或者平均值,或者2范数,我们以取最大值的max pooling为例(相对而言,max pooling的效果更好,所以我们通常采用max pooling),给出一个diagram:

CS231n Convolutional Neural Networks for Visual Recognition.png
CS231n Convolutional Neural Networks for Visual Recognition.png


有些人认为池化层并不是必要的,如Striving for Simplicity: The All Convolutional Net。此外,有人发现去除池化层对于生成式模型(generative models)很重要,例如variational autoencoders(VAEs),generative adversarial networks(GANs)。可能在以后的模型结构中,池化层会逐渐减少或者消失。

3. 全连接层(Fully-connected layer)


4. 卷积神经网络结构(CNNs Architectures)


INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)


现在,我们以3个3x3的卷积层和1个7x7的卷积层为例,加以对比说明。从下图可以看出,这两种方法最终得到的activation map大小是一致的,但3个3x3的卷积层明显更好:

3_3x3 VS 1_7x7.png
3_3x3 VS 1_7x7.png


A simple CNNs architecture.png
A simple CNNs architecture.png


· INPUT --> FC/OUT      这其实就是个线性分类器· INPUT --> CONV --> RELU --> FC/OUT· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT

---> PS:

2、实际工程中,我们得预估一下内存,然后根据内存的情况去设定合理的值。例如输入是224x224x3得图片,过滤器大小为3x3,共64个,zero-padding为1,这样每张图片需要72MB的内存(这里的72MB囊括了图片以及对应的参数、梯度和激活值在内的,所需要的内存空间),但是在GPU上运行的话,内存可能不够(相比于CPU,GPU的内存要小得多),所以需要调整下参数,比如过滤器大小改为7x7,stride改为2(ZF net),或者过滤器大小改为11x11,stride改为4(AlexNet)。


  • 大量的激活值和中间梯度值;
  • 参数,反向传播时的梯度以及使用momentum,Adagrad,or RMSProp时的缓存都会占用储存,所以估计参数占用的内存时,一般至少要乘以3倍;
  • 数据的batch以及其他的类似信息或者来源信息等也会消耗一部分内存。

· LeNet,这是最早成功应用的卷积神经网络,Yann LeCun在论文LeNet中提出。
· AlexNet,2012 ILSVRC竞赛远超第2名的卷积神经网络,掀起了深度学习的浪潮。
· ZF Net,2013 ILSVRC竞赛冠军,调整了Alexnet的结构参数, 扩增了中间卷积层。
· GoogLeNet,2014 ILSVRC竞赛冠军,极大地减少了参数数量(由 60M到4M)。
· VGGNet,2014 ILSVRC,证明了CNNs的深度对于最后的效果有至关重要的作用。
· ResNet,2015 ILSVRC竞赛冠军,截止2016年5月10,这是最先进的模型。最近Kaiming He等人,提出了改进版Identity Mappings in Deep Residual Networks。

From Kaiming He's ICML16 tutorial
From Kaiming He's ICML16 tutorial

Part 3:Python编程任务(3-layer CNNs)

---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive

在给出卷积层的代码前,我们先理解下卷积层的前向和后向传播时,具体是如何计算的。为了理解方便,我们假设某一个batch里的第一张图片为x[0, :, :, :],有RGB三个通道,每个通道大小为7x7,padding为1,stride为2,那么x[0, :, :, :]的大小为1x3x9x9;此外,我们假设有3个过滤器,每个大小为3x3,用w表示所有过滤器中的权重(如第一个滤波器的第一个通道为w[0, 0, :, :]);偏置b的大小为1x3;activation maps用out来表示,大小为3x4x4(如第一个map为out[0, :, :])。




__coauthor__ = 'Deeplayer'# 6.25.2016 #def conv_forward_naive(x, w, b, conv_param):    stride, pad = conv_param['stride'], conv_param['pad']    N, C, H, W = x.shape    F, C, HH, WW = w.shape    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')    H_new = 1 + (H + 2 * pad - HH) / stride    W_new = 1 + (W + 2 * pad - WW) / stride    s = stride    out = np.zeros((N, F, H_new, W_new))    for i in xrange(N):       # ith image            for f in xrange(F):   # fth filter                    for j in xrange(H_new):                            for k in xrange(W_new):                                    out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]    cache = (x, w, b, conv_param)    return out, cachedef conv_backward_naive(dout, cache):    x, w, b, conv_param = cache    pad = conv_param['pad']    stride = conv_param['stride']    F, C, HH, WW = w.shape    N, C, H, W = x.shape    H_new = 1 + (H + 2 * pad - HH) / stride    W_new = 1 + (W + 2 * pad - WW) / stride    dx = np.zeros_like(x)    dw = np.zeros_like(w)    db = np.zeros_like(b)    s = stride    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')    dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')    for i in xrange(N):       # ith image            for f in xrange(F):   # fth filter                    for j in xrange(H_new):                            for k in xrange(W_new):                                    window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]                    db[f] += dout[i, f, j, k]                                    dw[f] += window * dout[i, f, j, k]                                    dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]    # Unpad    dx = dx_padded[:, :, pad:pad+H, pad:pad+W]    return dx, dw, db



__coauthor__ = 'Deeplayer'# 6.25.2016 #def max_pool_forward_naive(x, pool_param):    HH, WW = pool_param['pool_height'], pool_param['pool_width']    s = pool_param['stride']    N, C, H, W = x.shape    H_new = 1 + (H - HH) / s    W_new = 1 + (W - WW) / s    out = np.zeros((N, C, H_new, W_new))    for i in xrange(N):            for j in xrange(C):                    for k in xrange(H_new):                            for l in xrange(W_new):                                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]                     out[i, j, k, l] = np.max(window)    cache = (x, pool_param)    return out, cachedef max_pool_backward_naive(dout, cache):    x, pool_param = cache    HH, WW = pool_param['pool_height'], pool_param['pool_width']    s = pool_param['stride']    N, C, H, W = x.shape    H_new = 1 + (H - HH) / s    W_new = 1 + (W - WW) / s    dx = np.zeros_like(x)    for i in xrange(N):            for j in xrange(C):                    for k in xrange(H_new):                            for l in xrange(W_new):                                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]                                    m = np.max(window)                                   dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]    return dx



Naive vs Fast.png
Naive vs Fast.png


__coauthor__ = 'Deeplayer'# 6.25.2016 #from layer_utils import *class ThreeLayerConvNet(object):        """        A three-layer convolutional network with the following architecture:              conv - relu - 2x2 max pool - affine - relu - affine - softmax    """    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,                              hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,                 dtype=np.float32):        self.params = {}        self.reg = reg        self.dtype = dtype        # Initialize weights and biases        C, H, W = input_dim        self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)        self.params['b1'] = np.zeros((1, num_filters))        self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)        self.params['b2'] = np.zeros((1, hidden_dim))        self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)        self.params['b3'] = np.zeros((1, num_classes))        for k, v in self.params.iteritems():                self.params[k] = v.astype(dtype)    def loss(self, X, y=None):        W1, b1 = self.params['W1'], self.params['b1']        W2, b2 = self.params['W2'], self.params['b2']        W3, b3 = self.params['W3'], self.params['b3']        # pass conv_param to the forward pass for the convolutional layer        filter_size = W1.shape[2]        conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}        # pass pool_param to the forward pass for the max-pooling layer        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}        # compute the forward pass        a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)        a2, cache2 = affine_relu_forward(a1, W2, b2)        scores, cache3 = affine_forward(a2, W3, b3)        if y is None:                return scores        # compute the backward pass        data_loss, dscores = softmax_loss(scores, y)        da2, dW3, db3 = affine_backward(dscores, cache3)        da1, dW2, db2 = affine_relu_backward(da2, cache2)        dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)        # Add regularization        dW1 += self.reg * W1        dW2 += self.reg * W2        dW3 += self.reg * W3        reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])        loss = data_loss + reg_loss        grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}        return loss, grads


3)、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函数。在给出代码前,我放张图,方便大家理解CNNs里的Batch Normalization是怎么计算卷积层的均值mean和标准差std的:

ConvNet Batch Normalization.png
ConvNet Batch Normalization.png


__coauthor__ = 'Deeplayer'# 6.25.2016 #def spatial_batchnorm_forward(x, gamma, beta, bn_param):    N, C, H, W = x.shape    x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)    out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)    out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)    return out, cachedef spatial_batchnorm_backward(dout, cache):    N, C, H, W = dout.shape    dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)    dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)    dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)    return dx, dgamma, dbeta


以上面完成的ThreeLayerConvNet为例,比较下使用和不使用Batch Normalization对收敛速度的影响。从下图中的结果可以看出,使用Batch Normalization明显加快了收敛,使得训练速度大幅提升(因为需要的epoch更少):

with BN --vs-- without BN.png
with BN --vs-- without BN.png

---> PS:
1、数据扩增(Data Augmentation)
1)、水平翻转(Horizontal flips)

Horizontal flips.png
Horizontal flips.png

2)、随机剪裁(Random crops/scales)
Random crops/scales.png
Random crops/scales.png

3)、色彩抖动(Color jitter)
Randomly jitter contrast.png
Randomly jitter contrast.png

4)、发挥想象力(Get creative)

下面我给出一个CNN模型,测试其在CIFAR-10上的表现(进行简单的水平翻转来扩增数据),training set: 49000x2, validation set: 1000, test set: 10000。CNN层数结构如下:

           [[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax

· Validation set accuracy: 0.904
· Test set accuracy: 0.892

Training loss & Accuracy
Training loss & Accuracy
CONV layer 1: filters
CONV layer 1: filters

Part 4: 可视化卷积神经网络


1. 可视化权重和激活值


CONV layer 1: filters(left) and activations(right)
CONV layer 1: filters(left) and activations(right)
CONV layer 2: filters(left) and activations(right)
CONV layer 2: filters(left) and activations(right)
CONV layer 3: activations
CONV layer 3: activations
CONV layer 4: activations
CONV layer 4: activations
CONV layer 5: activations
CONV layer 5: activations
Fully-connected layer 1 & 2
Fully-connected layer 1 & 2
Output layer
Output layer

2. 检索能最大限度激活神经元的图片

我们可以将大量图片输入网络,追踪那些可以最大限度激活神经元的图片,然后我们可以可视化这些图片,以此来理解神经元在它的感受野里究竟在寻找什么,以便能够正确地分类图片?下图是AlexNet的第五个pooling层(光头躺枪 O__O "…):

AlexNet: pooling layer 5
AlexNet: pooling layer 5

3. 利用t-SNE和CNNs的特征向量来可视化图片

CNNs可以表示为对输入图像进行逐层转化,最终形成一个可以用线性分类器进行分类的representation,这个最终形成的representation就是CNN codes(例如AlexNet里输入分类器之前的那个4096维向量),即特征向量。

t-SNE作为对高维数据降维并可视化的最好的方法之一,其可视化结果有非常棒的视觉效果。我们可以将CNN codes输入t-SNE,得到每一张图片(对应一个特征向量)对应的二维向量,然后可以可视化出如下结果(靠的越近的图片,在CNNs眼里越相似):

t-SNE visualization of CNN codes
t-SNE visualization of CNN codes

4. 局部遮挡图片


Occluding parts of the image
Occluding parts of the image

Part 5: 迁移学习(Transfer Learning)


CS231n Convolutional Neural Networks for Visual Recognition.png
CS231n Convolutional Neural Networks for Visual Recognition.png


---> CS231n: Assignment 1
---> CS231n: Assignment 3
