情感分析(Sentiment analysis)是自然语言处理(NLP)领域的一个任务,又称倾向性分析,意见抽取(Opinion extraction),意见挖掘(Opinion mining),情感挖掘(Sentiment mining),主观分析(Subjectivity analysis)等,它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程,如从电影评论中分析用户对电影的评价(positive、negative),从商品评论文本中分析用户对商品的“价格、大小、重 量、易用性”等属性的情感倾向。






本文将用三种方法循序渐进地讲述使用深度学习对IMDB评论进行情感分析。这三种方法为:MLP、BiRNN(LSTM、GRU)、BiGRU+Attention,IMDB的数据集可以从这里(点击打开链接)下载。使用的深度学习框架是Keras,后端是TensorFlow,在GPU服务器上运行,GPU服务器型号是TITAN X。





def clean_str(string):    """    Tokenization/string cleaning for dataset    Every dataset is lower cased except    """    string = re.sub(r"\\", "", string)    string = re.sub(r"\'", "", string)    string = re.sub(r"\"", "", string)    return string.strip().lower()data_train = pd.read_csv('/data/mpk/IMDB/labeledTrainData.tsv', sep='\t')print data_train.shapetexts = []labels = []for idx in range(data_train.review.shape[0]):    text = BeautifulSoup(data_train.review[idx], "lxml")    texts.append(clean_str(text.get_text().encode('ascii','ignore')))    labels.append(data_train.sentiment[idx])
labels = to_categorical(np.asarray(labels))


tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)tokenizer.fit_on_texts(texts)sequences = tokenizer.texts_to_sequences(texts)word_index = tokenizer.word_indexdata = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)


indices = np.arange(data.shape[0])np.random.shuffle(indices)data = data[indices]labels = labels[indices]nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])x_train = data[:-nb_validation_samples]y_train = labels[:-nb_validation_samples]x_val = data[-nb_validation_samples:]y_val = labels[-nb_validation_samples:]

将数据序列化之后,每一句话就变成了固定长度(1000)的index序列,每一个index对应一个词语。接下来我们将index对应到词语的word Embedding(词向量),这里使用的是glove.6B.100d,即每个词用100维向量表示,glove词向量可以在这里(点击打开链接)下载。未登录词(OOV问题)采取的是随机初始化向量,词向量不可训练。

GLOVE_DIR = "/data/mpk"embeddings_index = {}f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))for line in f:    values = line.split()    word = values[0]    coefs = np.asarray(values[1:], dtype='float32')    embeddings_index[word] = coefsf.close()print('Total %s word vectors.' % len(embeddings_index))

embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))for word, i in word_index.items():    embedding_vector = embeddings_index.get(word)    if embedding_vector is not None:        # words not found in embedding index will be all-zeros.        embedding_matrix[i] = embedding_vectorprint ('Length of embedding_matrix:', embedding_matrix.shape[0])embedding_layer = Embedding(len(word_index) + 1,                            EMBEDDING_DIM,                            weights=[embedding_matrix],                            mask_zero=False,                            input_length=MAX_SEQUENCE_LENGTH,                            trainable=False)




sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)dense_1 = Dense(100,activation='tanh')(embedded_sequences)max_pooling = GlobalMaxPooling1D()(dense_1)dense_2 = Dense(2, activation='softmax')(max_pooling)model = Model(sequence_input, dense_2)model.compile(loss='categorical_crossentropy',              optimizer='rmsprop',              metrics=['acc'])model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val),          nb_epoch=10, batch_size=50)

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1000)          0                                            
embedding_1 (Embedding)          (None, 1000, 100)     8054700     input_1[0][0]                    
dense_1 (Dense)                  (None, 1000, 100)     10100       embedding_1[0][0]                
globalmaxpooling1d_1 (GlobalMaxP (None, 100)           0           dense_1[0][0]                    
dense_2 (Dense)                  (None, 2)             202         globalmaxpooling1d_1[0][0]       
Total params: 8,065,002
Trainable params: 10,302
Non-trainable params: 8,054,700


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 3s - loss: 0.5431 - acc: 0.7467 - val_loss: 0.4496 - val_acc: 0.8108
Epoch 2/10
20000/20000 [==============================] - 2s - loss: 0.3993 - acc: 0.8329 - val_loss: 0.3752 - val_acc: 0.8436
Epoch 3/10
20000/20000 [==============================] - 2s - loss: 0.3511 - acc: 0.8541 - val_loss: 0.3527 - val_acc: 0.8552
Epoch 4/10
20000/20000 [==============================] - 2s - loss: 0.3183 - acc: 0.8707 - val_loss: 0.3393 - val_acc: 0.8628
Epoch 5/10
20000/20000 [==============================] - 2s - loss: 0.2958 - acc: 0.8801 - val_loss: 0.3325 - val_acc: 0.8616
Epoch 6/10
20000/20000 [==============================] - 2s - loss: 0.2765 - acc: 0.8901 - val_loss: 0.3256 - val_acc: 0.8654
Epoch 7/10
20000/20000 [==============================] - 2s - loss: 0.2612 - acc: 0.8973 - val_loss: 0.3358 - val_acc: 0.8628
Epoch 8/10
20000/20000 [==============================] - 2s - loss: 0.2466 - acc: 0.9034 - val_loss: 0.3195 - val_acc: 0.8680
Epoch 9/10
20000/20000 [==============================] - 2s - loss: 0.2330 - acc: 0.9110 - val_loss: 0.3260 - val_acc: 0.8648
Epoch 10/10
20000/20000 [==============================] - 2s - loss: 0.2220 - acc: 0.9161 - val_loss: 0.3192 - val_acc: 0.8650






sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)l_gru = Bidirectional(LSTM(100, return_sequences=False))(embedded_sequences)dense_1 = Dense(100,activation='tanh')(l_gru)dense_2 = Dense(2, activation='softmax')(dense_1)model = Model(sequence_input, dense_2)model.compile(loss='categorical_crossentropy',              optimizer='rmsprop',              metrics=['acc'])model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val),          nb_epoch=10, batch_size=50)


Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1000)          0                                            
embedding_1 (Embedding)          (None, 1000, 100)     8054700     input_1[0][0]                    
bidirectional_1 (Bidirectional)  (None, 1000, 200)     160800      embedding_1[0][0]                
attention_layer_1 (Attention_lay (None, 200)           40200       bidirectional_1[0][0]            
dense_1 (Dense)                  (None, 100)           20100       attention_layer_1[0][0]          
dense_2 (Dense)                  (None, 2)             202         dense_1[0][0]                    
Total params: 8,276,002
Trainable params: 221,302
Non-trainable params: 8,054,700


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 1225s - loss: 0.6145 - acc: 0.6607 - val_loss: 0.4628 - val_acc: 0.7974
Epoch 2/10
20000/20000 [==============================] - 1225s - loss: 0.4354 - acc: 0.8004 - val_loss: 0.3667 - val_acc: 0.8418
Epoch 3/10
20000/20000 [==============================] - 1228s - loss: 0.3561 - acc: 0.8446 - val_loss: 0.3283 - val_acc: 0.8566
Epoch 4/10
20000/20000 [==============================] - 1227s - loss: 0.3161 - acc: 0.8683 - val_loss: 0.3147 - val_acc: 0.8652
Epoch 5/10
20000/20000 [==============================] - 1230s - loss: 0.2863 - acc: 0.8816 - val_loss: 0.3059 - val_acc: 0.8760
Epoch 6/10
20000/20000 [==============================] - 1234s - loss: 0.2603 - acc: 0.8952 - val_loss: 0.2988 - val_acc: 0.8756
Epoch 7/10
20000/20000 [==============================] - 1230s - loss: 0.2377 - acc: 0.9042 - val_loss: 0.2947 - val_acc: 0.8782
Epoch 8/10
20000/20000 [==============================] - 1224s - loss: 0.2143 - acc: 0.9142 - val_loss: 0.3108 - val_acc: 0.8736
Epoch 9/10
20000/20000 [==============================] - 1231s - loss: 0.1895 - acc: 0.9255 - val_loss: 0.3183 - val_acc: 0.8748
Epoch 10/10
20000/20000 [==============================] - 1227s - loss: 0.1631 - acc: 0.9367 - val_loss: 0.3362 - val_acc: 0.8726




sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)l_gru = Bidirectional(GRU(100, return_sequences=False))(embedded_sequences)dense_1 = Dense(100,activation='tanh')(l_gru)dense_2 = Dense(2, activation='softmax')(dense_1)model = Model(sequence_input, dense_2)model.compile(loss='categorical_crossentropy',              optimizer='rmsprop',              metrics=['acc'])model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val),          nb_epoch=10, batch_size=50)


Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1000)          0                                            
embedding_1 (Embedding)          (None, 1000, 100)     8054700     input_1[0][0]                    
bidirectional_1 (Bidirectional)  (None, 200)           120600      embedding_1[0][0]                
dense_1 (Dense)                  (None, 100)           20100       bidirectional_1[0][0]            
dense_2 (Dense)                  (None, 2)             202         dense_1[0][0]                    
Total params: 8,195,602
Trainable params: 140,902
Non-trainable params: 8,054,700


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 998s - loss: 0.5436 - acc: 0.7137 - val_loss: 0.3707 - val_acc: 0.8386
Epoch 2/10
20000/20000 [==============================] - 987s - loss: 0.3565 - acc: 0.8460 - val_loss: 0.3208 - val_acc: 0.8680
Epoch 3/10
20000/20000 [==============================] - 976s - loss: 0.3060 - acc: 0.8695 - val_loss: 0.2951 - val_acc: 0.8774
Epoch 4/10
20000/20000 [==============================] - 977s - loss: 0.2692 - acc: 0.8894 - val_loss: 0.3196 - val_acc: 0.8614
Epoch 5/10
20000/20000 [==============================] - 969s - loss: 0.2380 - acc: 0.9039 - val_loss: 0.2882 - val_acc: 0.8790
Epoch 6/10
20000/20000 [==============================] - 981s - loss: 0.2084 - acc: 0.9177 - val_loss: 0.2800 - val_acc: 0.8884
Epoch 7/10
20000/20000 [==============================] - 976s - loss: 0.1780 - acc: 0.9293 - val_loss: 0.2961 - val_acc: 0.8842
Epoch 8/10
20000/20000 [==============================] - 971s - loss: 0.1473 - acc: 0.9450 - val_loss: 0.3280 - val_acc: 0.8848
Epoch 9/10
20000/20000 [==============================] - 913s - loss: 0.1160 - acc: 0.9564 - val_loss: 0.4539 - val_acc: 0.8580
Epoch 10/10
20000/20000 [==============================] - 917s - loss: 0.0964 - acc: 0.9651 - val_loss: 0.3915 - val_acc: 0.8740




Attention模型最早提出是用在图像识别上的,模仿人类的注意力机制,给图像不同的局部赋予不同的权重。在自然语言中使用最早是在机器翻译领域,这里我们在BiLSTM的基础上添加一个Attention Model,即对BiLSTM的隐层每一个时间步的向量学习一个权重,也就是在得到句子的向量表示时对评论文本中不同的词赋予不同的权值,然后由这些不同权值的词向量加权得到句子的向量表示。



from keras import backend as Kfrom keras.engine.topology import Layerfrom keras import initializations, regularizers, constraintsclass Attention_layer(Layer):    """        Attention operation, with a context/query vector, for temporal data.        Supports Masking.        Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]        "Hierarchical Attention Networks for Document Classification"        by using a context vector to assist the attention        # Input shape            3D tensor with shape: `(samples, steps, features)`.        # Output shape            2D tensor with shape: `(samples, features)`.        :param kwargs:        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.        The dimensions are inferred based on the output shape of the RNN.        Example:            model.add(LSTM(64, return_sequences=True))            model.add(AttentionWithContext())        """    def __init__(self,                 W_regularizer=None, b_regularizer=None,                 W_constraint=None, b_constraint=None,                 bias=True, **kwargs):        self.supports_masking = True        self.init = initializations.get('glorot_uniform')        self.W_regularizer = regularizers.get(W_regularizer)        self.b_regularizer = regularizers.get(b_regularizer)        self.W_constraint = constraints.get(W_constraint)        self.b_constraint = constraints.get(b_constraint)        self.bias = bias        super(Attention_layer, self).__init__(**kwargs)    def build(self, input_shape):        assert len(input_shape) == 3        self.W = self.add_weight((input_shape[-1], input_shape[-1],),                                 initializer=self.init,                                 name='{}_W'.format(self.name),                                 regularizer=self.W_regularizer,                                 constraint=self.W_constraint)        if self.bias:            self.b = self.add_weight((input_shape[-1],),                                     initializer='zero',                                     name='{}_b'.format(self.name),                                     regularizer=self.b_regularizer,                                     constraint=self.b_constraint)        super(Attention_layer, self).build(input_shape)    def compute_mask(self, input, input_mask=None):        # do not pass the mask to the next layers        return None    def call(self, x, mask=None):        uit = K.dot(x, self.W)        if self.bias:            uit += self.b        uit = K.tanh(uit)        a = K.exp(uit)        # apply mask after the exp. will be re-normalized next        if mask is not None:            # Cast the mask to floatX to avoid float64 upcasting in theano            a *= K.cast(mask, K.floatx())        # in some cases especially in the early stages of training the sum may be almost zero        # and this results in NaN's. A workaround is to add a very small positive number to the sum.        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())        weighted_input = x * a        return K.sum(weighted_input, axis=1)    def get_output_shape_for(self, input_shape):        return input_shape[0], input_shape[-1]


sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)l_gru = Bidirectional(LSTM(100, return_sequences=True))(embedded_sequences)l_att = Attention_layer()(l_gru)dense_1 = Dense(100,activation='tanh')(l_att)dense_2 = Dense(2, activation='softmax')(dense_1)model = Model(sequence_input, dense_2)model.compile(loss='categorical_crossentropy',              optimizer='rmsprop',              metrics=['acc'])model.summary()model.fit(x_train, y_train, validation_data=(x_val, y_val),          nb_epoch=10, batch_size=50)


model fitting - attention GRU network
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1000)          0                                            
embedding_1 (Embedding)          (None, 1000, 100)     8054700     input_1[0][0]                    
bidirectional_1 (Bidirectional)  (None, 1000, 200)     160800      embedding_1[0][0]                
attention_1 (Attention)          (None, 200)           40400       bidirectional_1[0][0]            
dense_1 (Dense)                  (None, 2)             402         attention_1[0][0]                
Total params: 8,256,302
Trainable params: 8,256,302
Non-trainable params: 0


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 1190s - loss: 0.4394 - acc: 0.7988 - val_loss: 0.3183 - val_acc: 0.8714
Epoch 2/10
20000/20000 [==============================] - 1191s - loss: 0.2919 - acc: 0.8807 - val_loss: 0.2717 - val_acc: 0.8928
Epoch 3/10
20000/20000 [==============================] - 1182s - loss: 0.2234 - acc: 0.9109 - val_loss: 0.2462 - val_acc: 0.8984
Epoch 4/10
20000/20000 [==============================] - 1111s - loss: 0.1714 - acc: 0.9359 - val_loss: 0.2430 - val_acc: 0.9054
Epoch 5/10
20000/20000 [==============================] - 1098s - loss: 0.1304 - acc: 0.9538 - val_loss: 0.2568 - val_acc: 0.9018
Epoch 6/10
20000/20000 [==============================] - 1101s - loss: 0.0942 - acc: 0.9665 - val_loss: 0.2876 - val_acc: 0.9030
Epoch 7/10
20000/20000 [==============================] - 1101s - loss: 0.0618 - acc: 0.9801 - val_loss: 0.3566 - val_acc: 0.8990
Epoch 8/10
20000/20000 [==============================] - 1104s - loss: 0.0441 - acc: 0.9868 - val_loss: 0.3851 - val_acc: 0.8960
Epoch 9/10
20000/20000 [==============================] - 1099s - loss: 0.0298 - acc: 0.9905 - val_loss: 0.4063 - val_acc: 0.8972
Epoch 10/10
20000/20000 [==============================] - 1107s - loss: 0.0208 - acc: 0.9934 - val_loss: 0.5198 - val_acc: 0.8834



