深度学习 —— 快速入门

来源：互联网发布：游戏优化学什么编辑：程序博客网时间：2024/04/25 13:11

本教程并不尝试去替代本科或研究生的机器学习课程，但我们会快速回顾一些重要的概念和标识以确保大家理解所讲的内容。你也需要下载本章提到的数据集从而能运行今后各章提到的例子。

下载

在每个学习算法页面，你能够下载相应的文件，如果你想一次全部下载，可以克隆本教程Git资源

git clone https://github.com/lisa-lab/DeepLearningTutorials.git

数据集

(mnist.pkl.gz)

MNIST数据集包含了手写数字图片并划分成包含60000个样本的训练集和10000个训练样本的测试集。在很多论文和本教程中，官方的60000个样本训练集又被细分成50000个样本的训练集和10000个样本的验证集（以选择超参数，例如学习速率和模型规模）。所有的数字图片都居中统一大小并固定为28*28像素。原始数据集中每个像素都表现为0-255之间的一个值，0代表黑色，255代表白色，中间数字代表不同程度的灰色。

为方便起见我们对数据集进行了预处理以方便使用python。下载地址http://deeplearning.net/data/mnist/mnist.pkl.gz

文件体现为（[训练集]，[验证集]，[测试集]），每一个数据集列表都由一个图像列表和相应的标签列表组成。每一个图像体现为一个784（28*28）位、介于0-1（0代表黑色，1代表白色）浮点值的numpy一维数组。标签是0-9之间代表图像所体现的数字。以下代码展示了如何载入数据集：

import cPickle, gzip, numpy# Load the datasetf = gzip.open('mnist.pkl.gz', 'rb')train_set, valid_set, test_set = cPickle.load(f)f.close()

当使用数据集时，我们一般把它分成迷你批次。我们鼓励将数据集储存成共享变量并基于迷你批次索引访问，并固定批次大小。使用共享变量的原因与使用GPU有关。把数据拷贝到GPU内存的成本很高，如果不使用共享变量而是每次根据需求把迷你批次数据拷贝到GPU，GPU就无法体现它对于CPU的优越性，有时甚至速度更慢。如果你把放到Theano的分享变量中，则当分享变量构建时可以一次性把整个数据集拷贝到GPU。GPU可以从分享变量中分批次取得数据而不需要从CPU内存中取数据，从而降低成本。考虑到数据和标签的性质不同（标签一般是整数，数据是实数）我们建议采用不同变量。我们也建议对训练集、验证集、测试集使用不同变量以增加代码可读性。

考虑到现在数据在一个变量中，迷你批次定义为该变量的一部分，自然的我们通过索引和大小来定义迷你批次。在我们的设定中整个代码执行过程中批次大小是固定的，所以功能模块只需要确定使用数据的索引即可。以下代码显示如何储存及访问迷你批次：

def shared_dataset(data_xy):    """ Function that loads the dataset into shared variables    The reason we store our dataset in shared variables is to allow    Theano to copy it into the GPU memory (when code is run on GPU).    Since copying data into the GPU is slow, copying a minibatch everytime    is needed (the default behaviour if the data is not in a shared    variable) would lead to a large decrease in performance.    """    data_x, data_y = data_xy    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))    # When storing data on the GPU it has to be stored as floats    # therefore we will store the labels as ``floatX`` as well    # (``shared_y`` does exactly that). But during our computations    # we need them as ints (we use labels as index, and if they are    # floats it doesn't make sense) therefore instead of returning    # ``shared_y`` we will have to cast it to int. This little hack    # lets us get around this issue    return shared_x, T.cast(shared_y, 'int32')test_set_x, test_set_y = shared_dataset(test_set)valid_set_x, valid_set_y = shared_dataset(valid_set)train_set_x, train_set_y = shared_dataset(train_set)batch_size = 500    # size of the minibatch# accessing the third minibatch of the training setdata  = train_set_x[2 * batch_size: 3 * batch_size]label = train_set_y[2 * batch_size: 3 * batch_size]

数据在GPU被储存为浮点数（正确的dtype可查询theano.config.floatX），对于标签我们将其存储为浮点数再转化为整数。

深度学习的有监督优化入门

深度学习让人激动大致在于深度网络的无监督学习。但有监督学习同样也非常重要。将无监督学习作为预处理的评估通常基于有监督学习细调后所能达到的效果。本章回顾分类模型的有监督学习基础，同时也涉及深度学习教程中用于细调模型的批处理随机梯度下降算法。更多使用梯度优化训练标准的基础，请见Introduction to Gradient-Based Learning

学习分类器

0-1损失

本深度学习教程所展现的模型主要用于分类。训练分类器的目的在于最小化未知样本的分类错误。如果预测函数为：

$f: R^D \rightarrow\{0,...,L\}$

则损失可以写作：

$\ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}$

其中：

$I_x = \left\{\begin{array}{ccc} 1&\mbox{ if $x$ is True} \\ 0&\mbox{ otherwise}\end{array}\right.$

$f(x) = {\rm argmax}_k P(Y=k | x, \theta)$

代码可以写作：

# zero_one_loss is a Theano variable representing a symbolic# expression of the zero one loss ; to get the actual value this# symbolic expression has to be compiled into a Theano function (see# the Theano tutorial for more details)zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))

负对数相似损失
考虑到0-1损失不可微分,优化有数千甚至百万个参数的大型模型在计算上代价昂贵而不采用。因此我们考虑最大化分类器的对数相似。 $\mathcal{L}(\theta, \mathcal{D}) = \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)$

正确类的可能性与正确预测数不完全相同，但从随机初始化分类器的角度它们非常相似。记住相似性与0-1损失目标不同，你应该在验证集中看到它们的相关性，但有时会出现一个上升一个下降的情况。
考虑到我们一般说最小化损失函数，所以学习过程会尝试最小化/负/指数相似（NLL），定义为：

$NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)$

分类器的NLL是0-1损失可微的替代，我们使用这个函数的梯度用于训练数据作为分类器深度学习的有监督学习信号。
代码如下：

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic# expression has to be compiled into a Theano function (see the Theano# tutorial for more details)NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this# syntax to retrieve the log-probability of the correct labels, y.

随机梯度下降

什么是普通的梯度下降？我们依据一些参数的损失函数，沿着一个误差平面，重复小步下行的一种简单算法。假设损失函数已考虑训练数据，则伪代码可表述为：

# GRADIENT DESCENTwhile True:    loss = f(params)    d_loss_wrt_params = ... # compute gradient    params -= learning_rate * d_loss_wrt_params    if <stopping condition is met>:        return params

随机梯度下降（SGD）与普通梯度下降原理相同，但通过依据部分而不是全部样本来估算梯度的方式加快速度。最极端的通过一个一次样本的方式

# STOCHASTIC GRADIENT DESCENTfor (x_i,y_i) in training_set:                            # imagine an infinite generator                            # that may repeat examples (if there is only a finite training set)    loss = f(params, x_i, y_i)    d_loss_wrt_params = ... # compute gradient    params -= learning_rate * d_loss_wrt_params    if <stopping condition is met>:        return params

我们推荐的深度学习在随机梯度下降的基础上更进一步，使用所谓的“微批次”，微批次随机梯度下降（MSGD）与随机梯度下降工作方式完全相同，只是采用多于一个的样本来估算梯度。这种方式降低了梯度预测的变动性，而且往往能更好的利用现代电脑的内存分层结构。

for (x_batch,y_batch) in train_batches:                            # imagine an infinite generator                            # that may repeat examples    loss = f(params, x_batch, y_batch)    d_loss_wrt_params = ... # compute gradient using theano    params -= learning_rate * d_loss_wrt_params    if <stopping condition is met>:        return params

在选择微批次B的大小时要有所权衡。当B从1到2时变动性降低最多，SIMD最有用，但边际提高很快消失。B很大时用于额外梯度步数的时间会被浪费在降低梯度估计器的变动减少上。最优的B基于模型、数据和硬件考虑，可以从1到数百。在本教程中我们主观的选择20。

如果你选择了固定训练次数，则要记住10次B=1的训练与同样10次B=20的训练结果会完全不同。所以当选择B大小时要同时考虑其他参数。

伪代码表示如下：

# Minibatch Stochastic Gradient Descent# assume loss is a symbolic description of the loss function given# the symbolic variables params (shared variable), x_batch, y_batch;# compute gradient of loss with respect to paramsd_loss_wrt_params = T.grad(loss, params)# compile the MSGD step into a theano functionupdates = [(params, params - learning_rate * d_loss_wrt_params)]MSGD = theano.function([x_batch,y_batch], loss, updates=updates)for (x_batch, y_batch) in train_batches:    # here x_batch and y_batch are elements of train_batches and    # therefore numpy arrays; function MSGD also updates the params    print('Current loss is ', MSGD(x_batch, y_batch))    if stopping_condition_is_met:        return params

正则化

除了优化还有很多工作要做。我们训练模型的主要目的是能更好的分类未知的样本而不是已知的样本。微批次随即梯度下降训练时并没有考虑这一点因此有可能造成过拟合。处理过拟合的一种方式是正则化。正则化有很多方式，我们这里介绍L1/L2和提前停止。

L1/L2 正则化

L1和L2正则化采用在损失函数中加入额外项从而处罚特定参数的设置。如果损失函数表现为：

$NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)$

则正则化损失表现为：

$E(\theta, \mathcal{D}) = NLL(\theta, \mathcal{D}) + \lambda R(\theta)\\$

在我们例子中表现为：

$E(\theta, \mathcal{D}) = NLL(\theta, \mathcal{D}) + \lambda||\theta||_p^p$

其中：

$||\theta||_p = \left(\sum_{j=0}^{|\theta|}{|\theta_j|^p}\right)^{\frac{1}{p}}$

这是对 $\theta$ 的 $L_p$ 正则化， $\lambda$ 是控制正则化参数的超参数，一般使用1或2作为p的值，所以有了L1/L2这个术语。如果p=2，正则表达式也称作“权重衰退”。

总体上在损失函数中加入正则项能通过惩罚大值参数，降低网络模型的非线性使神经网络的网络映射平滑。直观的NLL使模型更能表达数据，正则使模型更简单平滑。因此理论上最小化两者和代表最优选择。

注意方案简单并不等于泛化优良。从经验上看在神经网络中使用正则化有利于泛化，特别是对于小数据集。下列代码显示了用python计算正则化损失：

# symbolic Theano variable that represents the L1 regularization termL1  = T.sum(abs(param))# symbolic Theano variable that represents the squared L2 termL2 = T.sum(param ** 2)# the lossloss = NLL + lambda_1 * L1 + lambda_2 * L2

提早停止

提早停止通过观察模型对于验证集的表现来防止过拟合。验证集不用于梯度下降但不同于测试集的一个数据集。验证集视作测试样本，但我们可以在训练过程中使用它因为它不是测试集。如果模型在验证集上的表现不再有效提高，甚至下降则应该停止继续优化的尝试。

决定何时停止属于主观经验判断，但本教程将采用一种基于容忍度几何提升的策略。

# early-stopping parameterspatience = 5000  # look as this many examples regardlesspatience_increase = 2     # wait this much longer when a new best is                              # foundimprovement_threshold = 0.995  # a relative improvement of this much is                               # considered significantvalidation_frequency = min(n_train_batches, patience/2)                              # go through this many                              # minibatches before checking the network                              # on the validation set; in this case we                              # check every epochbest_params = Nonebest_validation_loss = numpy.inftest_score = 0.start_time = time.clock()done_looping = Falseepoch = 0while (epoch < n_epochs) and (not done_looping):    # Report "1" for first epoch, "n_epochs" for last epoch    epoch = epoch + 1    for minibatch_index in range(n_train_batches):        d_loss_wrt_params = ... # compute gradient        params -= learning_rate * d_loss_wrt_params # gradient descent        # iteration number. We want it to start at 0.        iter = (epoch - 1) * n_train_batches + minibatch_index        # note that if we do `iter % validation_frequency` it will be        # true for iter = 0 which we do not want. We want it true for        # iter = validation_frequency - 1.        if (iter + 1) % validation_frequency == 0:            this_validation_loss = ... # compute zero-one loss on validation set            if this_validation_loss < best_validation_loss:                # improve patience if loss improvement is good enough                if this_validation_loss < best_validation_loss * improvement_threshold:                    patience = max(patience, iter * patience_increase)                best_params = copy.deepcopy(params)                best_validation_loss = this_validation_loss        if patience <= iter:            done_looping = True            break# POSTCONDITION:# best_params refers to the best out-of-sample parameters observed during the optimization

如果批次先于容忍度用尽，则回到训练集开头并重复。

注意validation_frequency始终小于patience.容忍度用尽前代码要检查至少两次执行情况。这是我们使用validation_frequency = min(value, patience/2)的原因。

注意当提高容忍度时，可能通过使用统计重要性测试而不是简单比较改善该算法。

测试

退出循环后，best_params变量指代在验证集上表现最优的模型。如果为另一个模型类重复该过程，或甚至另外随即初始化，我们应该用相同的训练、验证、测试划分，并取得其他最优表现模型。如果我们要选择最优的模型类或最优初始化，我们比较每个模型的best_validation_loss.当我们根据验证集最终选定最优模型后，我们使用未见样本的测试集并取得最终结果。

总结

以上是关于优化的内容。提早停止技术要求我们将样本划分成训练、验证、测试三部分。训练集使用微批次随即梯度下降取得目标函数的可微近似。当我们执行梯度下降时，我们定期对比验证集上的表现来检测模型。当发现较好模型时我们将其保存，当较长时间模型没有改善时，我们返回表现最优的模型并用测试集进行最终测试。

Theano/Python 小贴士

载入和保存模型

使用梯度下降寻找最优参数常常需要数小时甚至几天时间，当你找到最优权重时切记保存，同时在过程中记得保存当前最优估计。

从分享变量中读取numpy ndarrays

保存模型参数最好的方式是用pickle或deepcopy来访问ndarray对象。例如，如果你的参数是分享变量w, v, u，那保存命令类似

>>> import cPickle>>> save_file = open('path', 'wb')  # this will overwrite current contents>>> cPickle.dump(w.get_value(borrow=True), save_file, -1)  # the -1 is for HIGHEST_PROTOCOL>>> cPickle.dump(v.get_value(borrow=True), save_file, -1)  # .. and it triggers much more efficient>>> cPickle.dump(u.get_value(borrow=True), save_file, -1)  # .. storage than numpy's default>>> save_file.close()

以后读取时类似：

>>> save_file = open('path')>>> w.set_value(cPickle.load(save_file), borrow=True)>>> v.set_value(cPickle.load(save_file), borrow=True)>>> u.set_value(cPickle.load(save_file), borrow=True)

这种方式有些繁琐，但经过检验确实有效。

不要使用pickle来长期保存训练或测试函数

Theano函数与python的deepcopy和pickle机制有很好的相容性，但使用它来保存Theano函数却不比要。如果你更新了Theano文件夹并且其中稍有改变则会无法读取模型。Theano仍在积极开发中，内部API仍会调整。保险起见，不要使用pickle长期保存训练或测试函数。pickle的机制主要用于短期存储，例如临时文件，拷贝到另一台机器进行分布式运算。

更多详见Loading and Saving或者12.1. pickle - Python object serialization - Python 3.6.1 documentation

图示中间结果

图示化是理解模型或训练算法工作状态的强大工具。你可能考虑在模型训练脚本中插入matplotlib画图命令，或PIL画图命令。但之后你在图象中发现了一些有意思的事情并想进一步分析。这时你希望你保存了之前的模型。

如果你有足够的磁盘空间，你的训练脚本应该保存中间模型，而图形化脚本应该针对这些保存的模型进行操作。

一些可用的库包括Python Image Libraryhttp://Python Imaging Library (PIL)和matplotlib

阅读全文

0 0