Theano-Deep Learning Tutorials 笔记:Getting Started

来源：互联网发布：python redis 连接池编辑：程序博客网时间：2024/05/20 18:43

教程地址：http://www.deeplearning.net/tutorial/gettingstarted.html

Datasets

（1）mnist手写数字集：每张是一个784维向量（28*28），像素值为0到1的float，每张代表一个0到9的数，50000张training set，10000张validation set（验证集用于类似学习率，model size等参数的选择），10000张testing set。

For convenience we pickled the dataset to make it easier to use in python.

import cPickle, gzip, numpy# Load the datasetf = gzip.open('mnist.pkl.gz', 'rb')train_set, valid_set, test_set = cPickle.load(f)f.close()

Note：cPickle包的功能和用法与pickle包几乎完全相同，cPickle用C码的，性能好很多。

（2）We encourage you to store the dataset into shared variablesand access it based on the minibatch index, given a fixed and known batch size（即代码中的batch_size =500）.

原因是：使用GPU时，不停地把数据拷贝到GPU效率不高，尽量使用Theano shared variables来提高性能；建议设6个不同共享变量，data：training set，validation set ，testing set 3个，label 3个。

def shared_dataset(data_xy):    #Function that loads the dataset into shared variables    data_x, data_y = data_xy    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))    # GPU上数据存储为float，y应该是int，所以return的时候用cast转成int，    return shared_x, T.cast(shared_y, 'int32')test_set_x, test_set_y = shared_dataset(test_set)valid_set_x, valid_set_y = shared_dataset(valid_set)train_set_x, train_set_y = shared_dataset(train_set)batch_size = 500    # size of the minibatch# accessing the third minibatch of the training setdata  = train_set_x[2 * batch_size: 3 * batch_size]label = train_set_y[2 * batch_size: 3 * batch_size]

如果出现内存溢出的情况：

you can store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores.

Learning a Classifier

Zero-One Loss

预测对的样本损失就是0，不对就是1，所有样本损失求和

If $f: R^D \rightarrow\{0,...,L\}$ is the prediction function, then this loss can be written as:

$\ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}$

where either $\mathcal{D}$ is the training set (during training) or $\mathcal{D} \cap \mathcal{D}_{train} = \emptyset$ (to avoid biasing the evaluation of validation or test error). $I$ is the indicator function defined as:

$I_x = \left\{\begin{array}{ccc} 1&\mbox{ if $x$ is True} \\ 0&\mbox{ otherwise}\end{array}\right.$

In this tutorial, $f$ is defined as:

$f(x) = {\rm argmax}_k P(Y=k | x, \theta)$

# zero_one_loss is a Theano variable representing a symbolic# expression of the zero one loss ; to get the actual value this# symbolic expression has to be compiled into a Theano function (see# the Theano tutorial for more details)zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))

Negative Log-Likelihood Loss

原理类似最大似然估计。

minimize the negative log-likelihood (NLL), defined as:

$NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)$

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic# expression has to be compiled into a Theano function (see the Theano# tutorial for more details)NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this# syntax to retrieve the log-probability of the correct labels, y.

Stochastic Gradient Descent

随机梯度下降是梯度下降的改进：

梯度下降求取所有样本损失的均值，每次迭代都对所有样本计算，计算量大，收敛慢；所以采用随机抽取小部分样本的方式（minibatch），每次计算minibatch的损失均值来调整参数。

minibatch的数量选择：选大了选小了都各有优劣。

An optimal $B$ is model-, dataset-, and hardware-dependent, and can be anywhere from 1 to maybe several hundreds.In the tutorial we set it to 20, but this choice is almost arbitrary (though harmless).

If you are training for a fixed number of epochs, the minibatch size becomes important because it controlsthe number of updates done to your parameters. Training the same model for 10 epochs using a batch size of 1 yields completely different results compared to training for the same 10 epochs but with a batchsize of 20.

# Minibatch Stochastic Gradient Descent# assume loss is a symbolic description of the loss function given# the symbolic variables params (shared variable), x_batch, y_batch;# compute gradient of loss with respect to paramsd_loss_wrt_params = T.grad(loss, params)# compile the MSGD step into a theano functionupdates = [(params, params - learning_rate * d_loss_wrt_params)]MSGD = theano.function([x_batch,y_batch], loss, updates=updates)for (x_batch, y_batch) in train_batches:    # here x_batch and y_batch are elements of train_batches and    # therefore numpy arrays; function MSGD also updates the params    print('Current loss is ', MSGD(x_batch, y_batch))    if stopping_condition_is_met:        return params

Regularization

机器学习中正则化随处可见，主要作用是防止过拟合。

直观的理解是：在损失函数中加入模型参数的范式，优化目标是使参数尽量小（接近0），这就是模型在原有基础上尽量简单，机器学习理论中，模型尽量简单就更不容易过拟合。（并不能一味追求简单，简单的模型并不一定泛化能力（generalization）就好）

L1 and L2 regularization

就是在损失函数后面加参数向量的1范数和2范数。

Formally, if our loss function is:

$NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)$

then the regularized loss will be:

$E(\theta, \mathcal{D}) = NLL(\theta, \mathcal{D}) + \lambda R(\theta)\\$

or, in our case

$E(\theta, \mathcal{D}) = NLL(\theta, \mathcal{D}) + \lambda||\theta||_p^p$

where

$||\theta||_p = \left(\sum_{j=0}^{|\theta|}{|\theta_j|^p}\right)^{\frac{1}{p}}$ p为1，2

正则化的详细介绍：

In principle, adding a regularization term to the loss will encouragesmooth network mappings in a neural network (bypenalizing large values of the parameters, whichdecreases the amount of nonlinearity that the network models). More intuitively, the two terms (NLL and $R(\theta)$ ) correspond tomodelling the data well (NLL) and having “simple” or “smooth” solutions ( $R(\theta)$ ). Thus, minimizing the sum of both will, in theory, correspond to finding theright trade-off （即折衷考虑）between the fit to the training data and the “generality” of the solution that is found. To followOccam’s razor principle, this minimization should find us thesimplest solution (as measured by our simplicity criterion) that fits the training data.

Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it was found that performingsuch regularization in the context of neural networks helps with generalization, especially on small datasets. The code block below shows how to compute the loss in python when it contains both a L1 regularization term weighted by $\lambda_1$ and L2 regularization term weighted by $\lambda_2$

# symbolic Theano variable that represents the L1 regularization termL1  = T.sum(abs(param))# symbolic Theano variable that represents the squared L2 termL2 = T.sum(param ** 2)# the lossloss = NLL + lambda_1 * L1 + lambda_2 * L2

Early-Stopping

Early-stopping通过测试模型在validation set的性能来防止过拟合。即当性能在测试集上不再显著提高甚至下降时，就停止优化迭代。

The choice of when to stop is a judgement call and a few heuristics（启发式） exist, but these tutorials will make use of a strategy based on a geometrically increasing amount ofpatience.（模拟一种耐心程度来决定何时停止）

# early-stopping parameterspatience = 5000  # look as this many examples regardlesspatience_increase = 2     # wait this much longer when a new best is                              # foundimprovement_threshold = 0.995  # a relative improvement of this much is                               # considered significantvalidation_frequency = min(n_train_batches, patience/2)                              # go through this many                              # minibatches before checking the network                              # on the validation set; in this case we                              # check every epoch 因为n_train_batches比patience/2小，每n_train_batches验证一次就是每epoch验证一次best_params = Nonebest_validation_loss = numpy.inftest_score = 0.start_time = time.clock()done_looping = Falseepoch = 0while (epoch < n_epochs) and (not done_looping):    # Report "1" for first epoch, "n_epochs" for last epoch    epoch = epoch + 1    for minibatch_index in xrange(n_train_batches):        d_loss_wrt_params = ... # compute gradient        params -= learning_rate * d_loss_wrt_params # gradient descent        # iteration number. We want it to start at 0.        iter = (epoch - 1) * n_train_batches + minibatch_index        # note that if we do `iter % validation_frequency` it will be        # true for iter = 0 which we do not want. We want it true for        # iter = validation_frequency - 1.        if (iter + 1) % validation_frequency == 0:            this_validation_loss = ... # compute zero-one loss on validation set            if this_validation_loss < best_validation_loss:                # improve patience if loss improvement is good enough                if this_validation_loss < best_validation_loss * improvement_threshold:                    patience = max(patience, iter * patience_increase)                best_params = copy.deepcopy(params)                best_validation_loss = this_validation_loss        if patience <= iter:            done_looping = True            break# POSTCONDITION:# best_params refers to the best out-of-sample parameters observed during the optimization

If we run out of batches of training data before running out of patience, then we just go back to the beginningof the training set and repeat.

代码过程是：

（1）不停地更新参数，iter不停在涨

（2）每隔validation_frequency这么多次，就验证一下

（3）如果在验证集上的损失有明显下降且iter * patience_increase>patience，patience就增长：patience = max(patience, iter * patience_increase) 注意patience_increase为2，iter越大，patience增长越多。

（4）iter，patience各自都在涨，当iter>=patience就停止了。

Note：validation_frequency = min(n_train_batches, patience/2)

这句代码保证了，无论什么情况下，都能验证2次及以上：假设patience不增长，在iter=patience/2时可以验证一次，在iter=patience时又可以验证一记，所以至少两次。

Note：This algorithm could possibly be improved by using a test ofstatistical significance rather than the simple comparison, when deciding whether to increase the patience.

Theano/Python Tips

Loading and Saving Models

训练，测试了半天，需要把得到的最佳参数储存下来，matlab非常容易储存，python则使用cPickle

Read more about serialization in Theano, or Python’s pickling.

Pickle the numpy ndarrays from your shared variables

if your parameters are in shared variables w, v,u, then your save command should look something like:

import cPicklesave_file = open('path', 'wb')  # this will overwrite current contentsPickle.dump(w.get_value(borrow=True), save_file, -1)  # the -1 is for HIGHEST_PROTOCOLcPickle.dump(v.get_value(borrow=True), save_file, -1)  # .. and it triggers much more efficientcPickle.dump(u.get_value(borrow=True), save_file, -1)  # .. storage than numpy's defaultsave_file.close()

Then later, you can load your data back like this:

save_file = open('path')w.set_value(cPickle.load(save_file), borrow=True)v.set_value(cPickle.load(save_file), borrow=True)u.set_value(cPickle.load(save_file), borrow=True)

Do not pickle your training or test functions for long-term storage

Theano functions are compatible with Python’s deepcopy and pickle mechanisms, but youshould not necessarily pickle a Theano function. If youupdate your Theano folder and one of the internal changes, then youmay not be able to un-pickle your model.

Plotting Intermediate Results

用PIL,matplotlib两个库实现可视化。

0 0