Theano-Deep Learning Tutorials 笔记:LSTM Networks for Sentiment Analysis

来源:互联网 发布:苹果windows怎么截屏 编辑:程序博客网 时间:2024/06/06 07:00

教程地址:http://deeplearning.net/tutorial/lstm.html

国外一个人的博客,图解比较多,可以对比着看:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

UCSD一个博士写的介绍:http://blog.terminal.com/demistifying-long-short-term-memory-lstm-recurrent-neural-networks/

Geoffrey Hinton在coursera上的Neural Networks for Machine Learning课程第7课介绍了RNN以及LSTM:

https://class.coursera.org/neuralnets-2012-001/lecture

这节代码的源码分析的博客:http://www.cnblogs.com/neopenx/p/4806006.html

 

Large Movie Review Dataset 数据集:用爬虫在IMDB上收集的影评文字,根据评分情况分为两类。数据集的下载使用及预处理脚本代码详见教程(本节教材提供的代码imdb.py中会自动从网上下载预处理过的数据集)

 

Model

In a traditional recurrent neural network, during the gradient back-propagation phase, thegradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, themagnitude of weights in the transition matrix can have astrong impact on the learning process.

传统RNN的训练是Back propagation Through Time,即跨时间步的反向传导,这就导致了梯度会被图中紫色权重乘好多好多次(时间跨度多少次就被乘多少次)。紫色权重对学习过程的影响就非常大了。

If the weights in this matrix are small (or, more formally, if theleading eigenvalue of the weight matrix is smaller than 1.0), it can lead to a situation calledvanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. It can also make more difficult the task of learning long-term dependencies in the data.Conversely, if the weights in this matrix are large (or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1.0), it can lead to a situation where thegradient signal is so large that it can cause learning to diverge. This is often referred to asexploding gradients.

如果权重小,不断的乘,会导致梯度越来越小,学习越来越慢甚至停止,这使得学习数据间长时间跨度的关系变得更加困难,这叫 vanishing gradients

如果权重大,会导致梯度太大而无法收敛,这叫 exploding gradients 。

所以不管权重大小,都对学习过程造成了恶劣的影响。

 

因为上述原因,才有了 LSTM 以及其重要组成结构 memory cell:

_images/lstm_memorycell.png

Figure 1 : Illustration of an LSTM memory cell.

 

memory cell 由4部分组成:an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate

The self-recurrent connection 权重为1,并不受外界干扰,the state of a memory cell can remain constant from one timestep to another.

The gates 负责调控 memory cell 自身和周围环境的交互。

The input gate can allow incoming signal to alter the state of the memory cell orblock it.

the output gate can allow the state of the memory cell to have an effect on other neuronsor prevent it.

the forget gate can modulate the memory cell’s self-recurrent connection, allowing the cell toremember or forget its previous state, as needed.

 

每个时间步t,一层 memory cells 的更新规则:

  • x_t is the input to the memory cell layer at timet
  • W_i,W_f,W_c,W_o,U_i,U_f,U_c,U_o andV_o are weight matrices
  • b_i,b_f,b_c andb_o are bias vectors

     

    First, we compute the values for i_t, theinput gate, and\widetilde{C_t} thecandidate value for the states of the memory cells at timet :

    i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)       (1)

    \widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)      (2)

    Second, we compute the value for f_t, the activation of the memory cells’forget gates at timet :

    f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)      (3)

    Given the value of the input gate activation i_t, the forget gate activationf_t and the candidate state value\widetilde{C_t}, we can computeC_t the memory cells’ new state at time t :

    C_t = i_t * \widetilde{C_t} + f_t * C_{t-1}      (4)

    With the new state of the memory cells, we can compute the value of their output gates and, subsequently, their outputs :

    o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_o)     (5)

    h_t = o_t * tanh(C_t)       (6)

     

    Our model

    出于计算效率上的考虑,教程的实现对标准LSTM有小小的修改:the activation of a cell’s output gate does not depend on the memory cell’s state C_t.

    所以公式(5)有所改变:

    there is no matrix V_o and equation (5) is replaced by equation (7) :

    o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)       (7)

     

    模型结构如下:一层 LSTM + 一个 mean pooling + 一个 logistic 回归来分类。

    _images/lstm.png

    输入序列 x_0, x_1, x_2, ..., x_n (n 表示 timestep),输出 representation 序列  h_0, h_1, h_2, ...,h_n,This representation sequence is then averaged over all timestepsresulting in representation h ,再把 h 给 logistic 回归进行分类。图中每个 x 都是一个128维i向量(由句子中某个单词 wordembbedding得到,向量维度在本教程代码中设为了128),0-n个x表示句子中n+1个单词,单词在句子中的顺序表现为 timestep 的时间顺序。0-n 每个 h 也是128维向量, 最后mean pooling 后输入LR分类器的 h 也是一个128维向量。

     

    Implementation note : 公式 (1), (2), (3) and (7) 之间并没有相互关系,可以并行计算提高效率。

    4个矩阵W_*W_i,W_f,W_c,W_o)被连成了一个W矩阵,4个U_*被连成U,4个b_* 被连成b

    公式(1), (2), (3) and (7)可以统一为一个公式:(这就是为什么(5)被改成了(7))

    z = \sigma(W x_t + U h_{t-1} + b)

    先算括号里的值,计算出结果后再分为 i,f,\widetilde{C_t}, ando,各自算非线性激活函数。

     

    下载教程代码运行即可:

    训练集上错误率很低,测试集上80%正确率。

     

    自己修改了教程的代码,可以输入任意字符串测试:

    用两句不太完整,包含各种奇怪字符的影评测试,结果如下:

     

    Note : 代码中有3个优化方法: Stochastic Gradient Descent (SGD), AdaDelta and RMSProp(Neural Networks for Machine Learning 第6课有介绍) optimization methods.

  • 0 0
    原创粉丝点击