theano学习指南--玻尔兹曼机(RBM)(翻译)
来源:互联网 发布:手机淘宝怎么看总金额 编辑:程序博客网 时间:2024/05/29 08:29
欢迎fork我的github:https://github.com/zhaoyu611/DeepLearningTutorialForChinese
最近在学习Git,所以正好趁这个机会,把学习到的知识实践一下~ 看完DeepLearning的原理,有了大体的了解,但是对于theano的代码,还是自己撸一遍印象更深 所以照着deeplearning.net上的代码,重新写了一遍,注释部分是原文翻译和自己的理解。 感兴趣的小伙伴可以一起完成这个工作哦~ 有问题欢迎联系我 Email: zhaoyuafeu@gmail.com QQ: 3062984605
基于能量的模型(EBM)
基于能量的模型是将每个变量的能量进行整合。通过学习,可以使模型拥有期望的属性。例如,我们想要变量有较低的能量,则定义基于能量的概率模型根据能量函数定义概率分布如下:
(1)
其中正则化因子称为配分函数:
基于能量的模型的训练可以是对训练数据的负对数似然函数 进行(随机)梯度计算。对于logistic回归,首先定义log似然函数,然后损失函数为负对数函数。
随机梯度为,其中为模型的参数。
带隐藏单元的EBMs
通常情况下,不需要获得完整的
(2)
该公式与公式(1)相似。我们引入(从物理学的启发)自由能的概念,定义如下:
(3)
因此,有下列公式:
数据的负对数似然函数的梯度有特殊的形式:
(4)
注意到上述梯度包含两部分,分别为正项和负项。正项和负项不代表公式中各项的符号,而是代表模型中它们对概率密度的影响。第一项增加了训练数据的概率(减少自由能的相关性),第二项较少了概率。
通常很难解析该梯度,因为它包含的计算。因为根据模型中的分布
计算过程中第一步是固定模型样本数量下估计期望。样本用来估计负数部分梯度,我们用来表示。梯度可以写成:
(5)
我们根据
关于采样方法的相关文献中,马尔科夫链蒙特卡洛法特别适用类似受限玻尔兹曼机(RBM)的模型,即一个具体的EBM模型。
受限玻尔兹曼机(RBM)
受限玻尔兹曼机是对数线性马尔科夫随机场(MRF)的特殊形式。例如,能量模型是线性的,而其中参数是可变的。为了让参数能更好的表示复杂分布(例如从有限的参数设置到无参数设置),我们考虑部分变量不做观察(它们称为隐藏)。为了获得更多的隐藏变量(也称作隐藏单元),我们可以扩充玻尔兹曼机(BM)的模型容量。受限玻尔兹曼机是BM的受限形式,它不包括可见-可见和隐藏-隐藏之间的连接。RBM的图片描述如下所示:
RBM的能量函数
(6)
其中,
自由能的公式可表示为:
考虑到RBMs的特殊结构,可见层和隐层是条件独立的,即给定其中一个,可知另一个。利用该属性,得到以下公式:
二进制的RBMs
在通常的二进制单元(
二进制的RBM的自由能可以简化为:
(9)
二进制RBM的更新函数
比较公式(5)和(9),我们得到一个二进制RBM的对数似然函数的梯度计算:
(10)
如果想了解上述公式的更多细节,建议读者阅读以下网页,或者 Learning Deep Architectures for AI的第五部分。我们不使用上述公式,而是根据公式(4)利用Theano T.grad得到梯度。
RBM的采样
对于RBMs,
其中代表第
下图为说明示例:
当时,样本是概率选择的样本。
理论上,学习过程中每个参数的更新要求运行这样的链直至收敛。毫无疑问,进行该操作是十分耗时耗力的。因此,从RBMs中衍生出若干算法,能够在学习过程中有效的从进行采样。
对比散度(CD-k)
对比散度有两个技巧可以加速采样过程:
- 因为我们最终目的是(得到真正的数据分布),用训练数据初始化马尔科夫链(例如,一个分布期望接近
p ,那么马尔科夫链就趋向最终分布p )。 - CD不需要等待链式收敛。只需要进行k步Gibbs采样,就能获取样本。实际上,
k=1 就能表示出很好的效果。
persisitent CD
persisitent CD [Tieleman08] 使用另一种类似方法从
直观感受是相比链的混合速率,如果参数更新足够小,马尔科夫链不能捕获模型中的改变。
执行
我们构造一个RBM类。网络的参数可以在初始化时确定,也可以作为参数传入类。当把RBM作为深度网络的一个模块时,这一可选类型是十分有用的:权重矩阵和隐层偏置与MLP网络的sigmoid层可以共享参数。
class RBM(object): """Restricted Boltzmann Machine (RBM) """ def __init__( self, input=None, n_visible=784, n_hidden=500, W=None, hbias=None, vbias=None, numpy_rng=None, theano_rng=None ): """ RBM constructor. Defines the parameters of the model along with basic operations for inferring hidden from visible (and vice-versa), as well as for performing CD updates. :param input: None for standalone RBMs or symbolic variable if RBM is part of a larger graph. :param n_visible: number of visible units :param n_hidden: number of hidden units :param W: None for standalone RBMs or symbolic variable pointing to a shared weight matrix in case RBM is part of a DBN network; in a DBN, the weights are shared between RBMs and layers of a MLP :param hbias: None for standalone RBMs or symbolic variable pointing to a shared hidden units bias vector in case RBM is part of a different network :param vbias: None for standalone RBMs or a symbolic variable pointing to a shared visible units bias """ self.n_visible = n_visible self.n_hidden = n_hidden if numpy_rng is None: # create a number generator numpy_rng = numpy.random.RandomState(1234) if theano_rng is None: theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) if W is None: # W is initialized with `initial_W` which is uniformely # sampled from -4*sqrt(6./(n_visible+n_hidden)) and # 4*sqrt(6./(n_hidden+n_visible)) the output of uniform if # converted using asarray to dtype theano.config.floatX so # that the code is runable on GPU initial_W = numpy.asarray( numpy_rng.uniform( low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), size=(n_visible, n_hidden) ), dtype=theano.config.floatX ) # theano shared variables for weights and biases W = theano.shared(value=initial_W, name='W', borrow=True) if hbias is None: # create shared variable for hidden units bias hbias = theano.shared( value=numpy.zeros( n_hidden, dtype=theano.config.floatX ), name='hbias', borrow=True ) if vbias is None: # create shared variable for visible units bias vbias = theano.shared( value=numpy.zeros( n_visible, dtype=theano.config.floatX ), name='vbias', borrow=True ) # initialize input layer for standalone RBM or layer0 of DBN self.input = input if not input: self.input = T.matrix('input') self.W = W self.hbias = hbias self.vbias = vbias self.theano_rng = theano_rng # **** WARNING: It is not a good idea to put things in this list # other than shared variables created in this function. self.params = [self.W, self.hbias, self.vbias]
下一步是根据公式(7)-(8)构造函数,代码如下:
def propup(self, vis): '''This function propagates the visible units activation upwards to the hidden units Note that we return also the pre-sigmoid activation of the layer. As it will turn out later, due to how Theano deals with optimizations, this symbolic variable will be needed to write down a more stable computational graph (see details in the reconstruction cost function) ''' pre_sigmoid_activation = T.dot(vis, self.W) + self.hbias return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]
def sample_h_given_v(self, v0_sample): ''' This function infers state of hidden units given visible units ''' # compute the activation of the hidden units given a sample of # the visibles pre_sigmoid_h1, h1_mean = self.propup(v0_sample) # get a sample of the hiddens given their activation # Note that theano_rng.binomial returns a symbolic sample of dtype # int64 by default. If we want to keep our computations in floatX # for the GPU we need to specify to return the dtype floatX h1_sample = self.theano_rng.binomial(size=h1_mean.shape, n=1, p=h1_mean, dtype=theano.config.floatX) return [pre_sigmoid_h1, h1_mean, h1_sample]
def propdown(self, hid): '''This function propagates the hidden units activation downwards to the visible units Note that we return also the pre_sigmoid_activation of the layer. As it will turn out later, due to how Theano deals with optimizations, this symbolic variable will be needed to write down a more stable computational graph (see details in the reconstruction cost function) ''' pre_sigmoid_activation = T.dot(hid, self.W.T) + self.vbias return [pre_sigmoid_activation, T.nnet.sigmoid(pre_sigmoid_activation)]
def sample_v_given_h(self, h0_sample): ''' This function infers state of visible units given hidden units ''' # compute the activation of the visible given the hidden sample pre_sigmoid_v1, v1_mean = self.propdown(h0_sample) # get a sample of the visible given their activation # Note that theano_rng.binomial returns a symbolic sample of dtype # int64 by default. If we want to keep our computations in floatX # for the GPU we need to specify to return the dtype floatX v1_sample = self.theano_rng.binomial(size=v1_mean.shape, n=1, p=v1_mean, dtype=theano.config.floatX) return [pre_sigmoid_v1, v1_mean, v1_sample]
我们可以用上述函数描述Gibbs采样过程。这里,定义两个函数:
- gibbs_vhv从可见单元开始执行一步采样过程,该函数对于RBM的采样十分有用。
- gibbs_hvh从隐层单元开始执行一步采样过程,该函数对于CD和PCD的更新十分有用。
代码如下:
def gibbs_hvh(self, h0_sample): ''' This function implements one step of Gibbs sampling, starting from the hidden state''' pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h0_sample) pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v1_sample) return [pre_sigmoid_v1, v1_mean, v1_sample, pre_sigmoid_h1, h1_mean, h1_sample]
def gibbs_vhv(self, v0_sample): ''' This function implements one step of Gibbs sampling, starting from the visible state''' pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v0_sample) pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h1_sample) return [pre_sigmoid_h1, h1_mean, h1_sample, pre_sigmoid_v1, v1_mean, v1_sample]
注意函数要求未sigmoid激活的值作为输入量。如果想深入了解这样做的原因,那么需要了解Theano的工作原理。当编译Theano函数时,计算图中输入量的速度和稳定性得到优化,这是通过改变子图中若干部分实现的。这样的优化代表softplus中log(sigmoid(x))项。对于交叉熵,当sigmoid值大于30(结果趋于1就需要这样的优化。当sigmoid值小于-30(结果趋于0),则Theano计算log(0),最终代价为-inf或者NaN。通常情况下,softplus中log(sigmoid(x))项会得到正常值。但这里遇到特殊情况:sigmoid在scan优化内部,log在外部。因此,Theano会执行log(scan(…))而不是log(sigmoid(…)),也不会进行优化。我们找不到替代scan中sigmoid的方法,因为只需要在最后一步执行。最简单有效的办法是输出未sigmoid的值,在scan之外同时应用log和sigmoid。
RBM类构造了自由能函数,用于计算参数的梯度(见公式4)。注意函数中,同样输出未sigmoid量。
def free_energy(self, v_sample): ''' Function to compute the free energy ''' wx_b = T.dot(v_sample, self.W) + self.hbias vbias_term = T.dot(v_sample, self.vbias) hidden_term = T.sum(T.log(1 + T.exp(wx_b)), axis=1) return -hidden_term - vbias_term
构造get_cost_updates函数,输出CD-k和PCD-k更新的梯度。
def get_cost_updates(self, lr=0.1, persistent=None, k=1): """This functions implements one step of CD-k or PCD-k :param lr: learning rate used to train the RBM :param persistent: None for CD. For PCD, shared variable containing old state of Gibbs chain. This must be a shared variable of size (batch size, number of hidden units). :param k: number of Gibbs steps to do in CD-k/PCD-k Returns a proxy for the cost and the updates dictionary. The dictionary contains the update rules for weights and biases but also an update of the shared variable used to store the persistent chain, if one is used. """ # compute positive phase pre_sigmoid_ph, ph_mean, ph_sample = self.sample_h_given_v(self.input) # decide how to initialize persistent chain: # for CD, we use the newly generate hidden sample # for PCD, we initialize from the old state of the chain if persistent is None: chain_start = ph_sample else: chain_start = persistent
注意到get_cost_updates有一个persistent的参数。因此,我们可以使用同一段代码执行CD和PCD。使用PCD时,persistent 是一个包含上次Gibbs采样的共享参数。
如果persistent 是None,那么在正项中对隐藏层样本初始化Gibbs链,执行CD。当决定了链的起始点,就能得到该链所有用于梯度计算(见公式4的样本。使用Theano提供的scan 来执行。该函数的使用建议读者阅读该链接。
# perform actual negative phase # in order to implement CD-k/PCD-k we need to scan over the # function that implements one gibbs step k times. # Read Theano tutorial on scan for more information : # http://deeplearning.net/software/theano/library/scan.html # the scan will return the entire Gibbs chain ( [ pre_sigmoid_nvs, nv_means, nv_samples, pre_sigmoid_nhs, nh_means, nh_samples ], updates ) = theano.scan( self.gibbs_hvh, # the None are place holders, saying that # chain_start is the initial state corresponding to the # 6th output outputs_info=[None, None, None, None, None, chain_start], n_steps=k, name="gibbs_hvh" )
生成Gibbs链之后,从链末端进行采样,从而得到负项的自由能。注意到chain_end是一个代表模型参数数量的Theano的符号变量。如果应用* T.grad*,那么该函数会通过Gibbs链得到梯度。这不是我们期望的(这会混淆梯度),而使用T.grad中的consider_constant 可以实现将T.grad 和* chain_end*作为常量的要求。
# determine gradients on RBM parameters # note that we only need the sample at the end of the chain chain_end = nv_samples[-1] cost = T.mean(self.free_energy(self.input)) - T.mean( self.free_energy(chain_end)) # We must not compute the gradient through the gibbs sampling gparams = T.grad(cost, self.params, consider_constant=[chain_end])
最后,利用scan(它包含theano_rng随机状态的更新规则)求出更新字典。对于PCD,同时需要更新Gibbs链状态的共享变量。
# constructs the update dictionary for gparam, param in zip(gparams, self.params): # make sure that the learning rate is of the right dtype updates[param] = param - gparam * T.cast( lr, dtype=theano.config.floatX ) if persistent: # Note that this works only if persistent is a shared variable updates[persistent] = nh_samples[-1] # pseudo-likelihood is a better proxy for PCD monitoring_cost = self.get_pseudo_likelihood_cost(updates) else: # reconstruction cross-entropy is a better proxy for CD monitoring_cost = self.get_reconstruction_cost(updates, pre_sigmoid_nvs[-1]) return monitoring_cost, updates
进度跟踪
RBMs的训练有很多技巧。考虑到公式(1)的配分函数,不能在训练过程中估计log似然函数
负样本的检验
训练过程中负样本的获取是可见的。通过训练,RBM定义的模型的越来越接近真实分布
可见滤波检验
模型的滤波学习过程是可见的。各个单元的权重组成灰度图(变换为方阵)。过滤器在数据中选择最强的特征。特征在原始MNIST上并不明显,就想探针一样的存在。 training on natural images lead to Gabor like filters if trained in conjunction with a sparsity criteria.(这句没看懂)
似然函数的替代
可用其他函数来代替似然函数。使用PCD训练RBM时,可用伪似然函数代替。伪似然函数(Pseudo likehood,PL)的计算量更小,当然该算法假设各参数是相互独立的。因此:
上式是求指定
通过RBM类的get_cost_updates函数得到代价和更新。需要注意的是,更新字典中增加了索引
CD训练输入和重构之间(与降噪自编码相同)的交叉熵代价比伪log似然函数更可靠。下面给出计算伪似然函数的代码:
def get_pseudo_likelihood_cost(self, updates): """Stochastic approximation to the pseudo-likelihood""" # index of bit i in expression p(x_i | x_{\i}) bit_i_idx = theano.shared(value=0, name='bit_i_idx') # binarize the input image by rounding to nearest integer xi = T.round(self.input) # calculate free energy for the given bit configuration fe_xi = self.free_energy(xi) # flip bit x_i of matrix xi and preserve all other bits x_{\i} # Equivalent to xi[:,bit_i_idx] = 1-xi[:, bit_i_idx], but assigns # the result to xi_flip, instead of working in place on xi. xi_flip = T.set_subtensor(xi[:, bit_i_idx], 1 - xi[:, bit_i_idx]) # calculate free energy with bit flipped fe_xi_flip = self.free_energy(xi_flip) # equivalent to e^(-FE(x_i)) / (e^(-FE(x_i)) + e^(-FE(x_{\i}))) cost = T.mean(self.n_visible * T.log(T.nnet.sigmoid(fe_xi_flip - fe_xi))) # increment bit_i_idx % number as part of updates updates[bit_i_idx] = (bit_i_idx + 1) % self.n_visible return cost
主循环
现在已经准备好了训练网络需要的所有元素。
在进行训练之前,读者应当熟悉函数* tile_raster_images*(见Plotting Samples and Filters)。因为RBM是生成模型,所以可以将样本以图的形式展现。同时,可以画出RBM的权重,更深刻的理解RBM的工作原理。值得注意的是,图并不是完整的工作原理,因为忽略了偏置,并将权重乘以常数(将权重转换到0-1之间)。
有了这些功能函数,就可以开始训练RBM,每次训练后将图保存本地。使用PCD训练RBM,可以得到效果更好的生成模型。([Tieleman08])
# it is ok for a theano function to have no output # the purpose of train_rbm is solely to update the RBM parameters train_rbm = theano.function( [index], cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size] }, name='train_rbm' ) plotting_time = 0. start_time = timeit.default_timer() # go through training epochs for epoch in range(training_epochs): # go through the training set mean_cost = [] for batch_index in range(n_train_batches): mean_cost += [train_rbm(batch_index)] print('Training epoch %d, cost is ' % epoch, numpy.mean(mean_cost)) # Plot filters after each training epoch plotting_start = timeit.default_timer() # Construct image from the weight matrix image = Image.fromarray( tile_raster_images( X=rbm.W.get_value(borrow=True).T, img_shape=(28, 28), tile_shape=(10, 10), tile_spacing=(1, 1) ) ) image.save('filters_at_epoch_%i.png' % epoch) plotting_stop = timeit.default_timer() plotting_time += (plotting_stop - plotting_start) end_time = timeit.default_timer() pretraining_time = (end_time - start_time) - plotting_time print ('Training took %f minutes' % (pretraining_time / 60.))
完成RBM训练后,使用gibbs_vhv函数执行Gibbs采样。我们不使用随机初始化,而是根据测试样本初始化Gibss链(也可以根据训练集合)加速收敛。使用Theano的scan进行1000次迭代,然后画一次图。
################################# # Sampling from the RBM # ################################# # find out the number of test samples number_of_test_samples = test_set_x.get_value(borrow=True).shape[0] # pick random test examples, with which to initialize the persistent chain test_idx = rng.randint(number_of_test_samples - n_chains) persistent_vis_chain = theano.shared( numpy.asarray( test_set_x.get_value(borrow=True)[test_idx:test_idx + n_chains], dtype=theano.config.floatX ) )
然后同时创建20条固定链进行采样。构造Theano函数实现一步Gibbs采样,并根据新的可见样本更新固定链的状态。迭代使用该函数,每1000步画一次图。
plot_every = 1000 # define one step of Gibbs sampling (mf = mean-field) define a # function that does `plot_every` steps before returning the # sample for plotting ( [ presig_hids, hid_mfs, hid_samples, presig_vis, vis_mfs, vis_samples ], updates ) = theano.scan( rbm.gibbs_vhv, outputs_info=[None, None, None, None, None, persistent_vis_chain], n_steps=plot_every, name="gibbs_vhv" ) # add to updates the shared variable that takes care of our persistent # chain :. updates.update({persistent_vis_chain: vis_samples[-1]}) # construct the function that implements our persistent chain. # we generate the "mean field" activations for plotting and the actual # samples for reinitializing the state of our persistent chain sample_fn = theano.function( [], [ vis_mfs[-1], vis_samples[-1] ], updates=updates, name='sample_fn' ) # create a space to store the image for plotting ( we need to leave # room for the tile_spacing as well) image_data = numpy.zeros( (29 * n_samples + 1, 29 * n_chains - 1), dtype='uint8' ) for idx in range(n_samples): # generate `plot_every` intermediate samples that we discard, # because successive samples in the chain are too correlated vis_mf, vis_sample = sample_fn() print(' ... plotting sample %d' % idx) image_data[29 * idx:29 * idx + 28, :] = tile_raster_images( X=vis_mf, img_shape=(28, 28), tile_shape=(1, n_chains), tile_spacing=(1, 1) ) # construct image image = Image.fromarray(image_data) image.save('samples.png')
结果
参数设置:PCD-15,学习率0.1,块大小20,迭代次数15。模型训练耗时122.466分钟。计算机配置:Intel Xeon E5430 @ 2.66GHz CPU,单线程GotoBLAS。
结果如下:
... loading dataTraining epoch 0, cost is -90.6507246003Training epoch 1, cost is -81.235857373Training epoch 2, cost is -74.9120966945Training epoch 3, cost is -73.0213216101Training epoch 4, cost is -68.4098570497Training epoch 5, cost is -63.2693021647Training epoch 6, cost is -65.99578971Training epoch 7, cost is -68.1236650015Training epoch 8, cost is -68.3207365087Training epoch 9, cost is -64.2949797113Training epoch 10, cost is -61.5194867893Training epoch 11, cost is -61.6539369402Training epoch 12, cost is -63.5465278086Training epoch 13, cost is -63.3787093527Training epoch 14, cost is -62.755739271Training took 122.466000 minutes ... plotting sample 0 ... plotting sample 1 ... plotting sample 2 ... plotting sample 3 ... plotting sample 4 ... plotting sample 5 ... plotting sample 6 ... plotting sample 7 ... plotting sample 8 ... plotting sample 9
下图展示滤波器15次迭代后的效果:
下图经过训练后RBM生成的样本。每行代表负粒子(粉分别从Gibbs链采样),每行都进行了1000次Gibbs采样。
- theano学习指南--玻尔兹曼机(RBM)(翻译)
- theano学习指南--玻尔兹曼机(RBM)(源码)
- theano学习指南1(翻译)
- theano学习指南1(翻译)
- theano学习指南4(翻译)- 卷积神经网络
- theano学习指南2(翻译)-对数回归分类器
- theano学习指南3(翻译)-多层感知器模型
- theano学习指南2(翻译)-对数回归分类器
- theano学习指南3(翻译)-多层感知器模型
- theano学习指南5(翻译)- 降噪自动编码器
- theano学习指南--深度置信网络(DBN)(翻译)
- theano学习指南---栈式降噪自编码SdA(翻译)
- theano学习指南--混合蒙特卡洛采样(翻译)
- deep learning tutorial 翻译 (theano学习指南2(翻译)-对数回归分类器 )
- deep learning tutorial 翻译 (theano学习指南3(翻译)-多层感知器模型)
- deep learning tutorial 翻译(theano学习指南4(翻译)- 卷积神经网络 )
- theano学习指南--词向量的循环神经网络(翻译)
- theano学习指南1
- [转]Ubuntu LightDM轻量级桌面显示管理器
- Android SurfaceView的绘制详解
- Linux内核中编写一个模块,实现申请一块内存,需要考虑哪些方面?
- 如何在csdn中写一篇博客
- Struts框架学习二
- theano学习指南--玻尔兹曼机(RBM)(翻译)
- 图像矩
- modem lte 无法启动
- 打开网站提示HTTP错误:Directory Listing Denied This Virtual Directory does not allow contents to be list.
- C#柱状图
- android 点击穿透
- 编写insertAfter函数
- hbuilder 夜神模拟器 谷歌 联调
- Apple tvos编译总结