如何用词向量做文本分类(embedding+cnn)
来源:互联网 发布:跟兄弟连学php怎么样 编辑:程序博客网 时间:2024/06/07 20:30
1、数据简介
本文使用的数据集是著名的”20 Newsgroup dataset”。该数据集共有20种新闻文本数据,我们将实现对该数据集的文本分类任务。数据集的说明和下载请参考(http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html)。本文使用GloVe词向量。GloVe 是 “Global Vectors for Word Representation”的缩写,一种基于共现矩阵分解的词向量。本文所使用的GloVe词向量是在2014年的英文维基百科上训练的,有400k个不同的词,每个词用100维向量表示。链接(http://nlp.stanford.edu/data/glove.6B.zip) (友情提示,词向量文件大小约为822M)
2、数据预处理
我们首先遍历下语料文件下的所有文件夹,获得不同类别的新闻以及对应的类别标签,代码如下所示
texts = [] # list of text sampleslabels_index = {} # dictionary mapping label name to numeric idlabels = [] # list of label idsimport osTEXT_DATA_DIR = 'e:/textm/20_newsgroup'for name in sorted(os.listdir(TEXT_DATA_DIR)): path = os.path.join(TEXT_DATA_DIR, name) if os.path.isdir(path): label_id = len(labels_index) labels_index[name] = label_id if label_id == 2: break for fname in sorted(os.listdir(path)): if fname.isdigit(): fpath = os.path.join(path, fname) f = open(fpath,'r',encoding='latin-1') texts.append(f.read().strip()) f.close() labels.append(label_id)print('Found %s texts.' % len(texts))print(texts[0])print(labels)
之后,我们可以新闻样本转化为神经网络训练所用的张量。所用到的Keras库是keras.preprocessing.text.Tokenizer和keras.preprocessing.sequence.pad_sequences。代码如下所示
######,我们可以新闻样本转化为神经网络训练所用的张量。# 所用到的Keras库是keras.preprocessing.text.Tokenizer和keras.preprocessing.sequence.pad_sequences。代码如下所示from keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesimport numpy as nptokenizer = Tokenizer()tokenizer.fit_on_texts(texts)sequences = tokenizer.texts_to_sequences(texts)word_index = tokenizer.word_indexprint('Found %s unique tokens.' % len(word_index))data = pad_sequences(sequences)# from keras.utils import np_utils# labels = np_utils.to_categorical(np.asarray(labels))print('Shape of data tensor:', data.shape)# split the data into a training set and a validation setindices = np.arange(data.shape[0])np.random.shuffle(indices)data = data[indices]labels_new = []for i in indices: labels_new.append(labels[i])nb_validation_samples = int(0.8 * data.shape[0])x_train = data[:-nb_validation_samples]y_train = labels_new[:-nb_validation_samples]x_val = data[-nb_validation_samples:]y_val = labels_new[-nb_validation_samples:]print(x_train[0])
接下来,我们从GloVe文件中解析出每个词和它所对应的词向量,并用字典的方式存储
###############读取词向量embeddings_index = {}f = open(os.path.join('E:\\textm', 'glove.6B.100d.txt'),'r',encoding='utf-8')for line in f.readlines(): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefsf.close()print('Found %s word vectors.' % len(embeddings_index))
此时,我们可以根据得到的字典生成上文所定义的词向量矩阵
#############我们可以根据得到的字典生成上文所定义的词向量矩阵embedding_matrix = np.zeros((len(word_index) + 1, 100))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vectorprint(embedding_matrix)#########我们将这个词向量矩阵加载到Embedding层中,注意,我们设置trainable=False使得这个编码层不可再训练。from keras.layers import Embeddingembedding_layer = Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=10036, trainable=False)
3、训练模型
用到了三层卷积
from keras.models import *from keras.layers import *from keras.applications import *from keras.preprocessing.image import *sequence_input = Input(shape=(10036,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)x = Conv1D(128, 5, activation='relu')(embedded_sequences)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(35)(x) # global max poolingx = Flatten()(x)x = Dense(128, activation='relu')(x)preds = Dense(1, activation='sigmoid')(x)model = Model(sequence_input, preds)model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])# happy learning!model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=4, batch_size=128)model.save('e:/mymodel.h5')
全部代码如下
texts = [] # list of text sampleslabels_index = {} # dictionary mapping label name to numeric idlabels = [] # list of label idsimport osTEXT_DATA_DIR = 'e:/textm/20_newsgroup'for name in sorted(os.listdir(TEXT_DATA_DIR)): path = os.path.join(TEXT_DATA_DIR, name) if os.path.isdir(path): label_id = len(labels_index) labels_index[name] = label_id if label_id == 2: break for fname in sorted(os.listdir(path)): if fname.isdigit(): fpath = os.path.join(path, fname) f = open(fpath,'r',encoding='latin-1') texts.append(f.read().strip()) f.close() labels.append(label_id)print('Found %s texts.' % len(texts))print(texts[0])print(labels)######,我们可以新闻样本转化为神经网络训练所用的张量。# 所用到的Keras库是keras.preprocessing.text.Tokenizer和keras.preprocessing.sequence.pad_sequences。代码如下所示from keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesimport numpy as nptokenizer = Tokenizer()tokenizer.fit_on_texts(texts)sequences = tokenizer.texts_to_sequences(texts)word_index = tokenizer.word_indexprint('Found %s unique tokens.' % len(word_index))data = pad_sequences(sequences)# from keras.utils import np_utils# labels = np_utils.to_categorical(np.asarray(labels))print('Shape of data tensor:', data.shape)# split the data into a training set and a validation setindices = np.arange(data.shape[0])np.random.shuffle(indices)data = data[indices]labels_new = []for i in indices: labels_new.append(labels[i])nb_validation_samples = int(0.8 * data.shape[0])x_train = data[:-nb_validation_samples]y_train = labels_new[:-nb_validation_samples]x_val = data[-nb_validation_samples:]y_val = labels_new[-nb_validation_samples:]print(x_train[0])###############读取词向量embeddings_index = {}f = open(os.path.join('E:\\textm', 'glove.6B.100d.txt'),'r',encoding='utf-8')for line in f.readlines(): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefsf.close()print('Found %s word vectors.' % len(embeddings_index))#############我们可以根据得到的字典生成上文所定义的词向量矩阵embedding_matrix = np.zeros((len(word_index) + 1, 100))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vectorprint(embedding_matrix)#########我们将这个词向量矩阵加载到Embedding层中,注意,我们设置trainable=False使得这个编码层不可再训练。from keras.layers import Embeddingembedding_layer = Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=10036, trainable=False)from keras.models import *from keras.layers import *from keras.applications import *from keras.preprocessing.image import *sequence_input = Input(shape=(10036,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)x = Conv1D(128, 5, activation='relu')(embedded_sequences)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(5)(x)x = Conv1D(128, 5, activation='relu')(x)x = MaxPooling1D(35)(x) # global max poolingx = Flatten()(x)x = Dense(128, activation='relu')(x)preds = Dense(1, activation='sigmoid')(x)model = Model(sequence_input, preds)model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])# happy learning!model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=4, batch_size=128)model.save('e:/mymodel.h5')
4、参考文献
http://keras-cn.readthedocs.io/en/latest/blog/word_embedding/
阅读全文
0 0
- 如何用词向量做文本分类(embedding+cnn)
- tensorflow使用CNN做文本分类
- TensorFlow学习笔记(9)--使用CNN做英文文本分类任务
- CNN情感分析(文本分类)
- CNN文本分类
- Text-CNN 文本分类
- CNN用于文本分类
- 三十七、利用支持向量机做文本分类
- 文本分类(三):文本转为词向量
- CNN文本分类 论文收集
- 使用CNN进行文本分类
- 文本分类之情感分析– 停用词和惯用语
- cnn在sentence分类和向量实验
- 用cnn做行人分类
- CNN在NLP领域的实践(1) 文本分类
- cnn、rnn实现中文文本分类(基于tensorflow)
- CNN在中文文本分类的应用
- 基于tensorflow的cnn文本分类
- GradationTitleBar渐变得标题样式
- mysql数据库快捷键
- CentOS7常用软件安装配置说明
- java多线程初步学习
- 如何查询表字段的索引并删除
- 如何用词向量做文本分类(embedding+cnn)
- 实现倒排索引
- fuzzing-01-freefloatftpserver1.0分析和利用
- Facebook获取AccessToken和获取个人主页信息
- Mysql服务 windows 启动问题 服务没有mysql
- 一个java获取当前项目路径的方法
- 高德地图多个Marker标记自动缩放全部显示在屏幕中
- Java缓存机制之Map实现
- Oracle数据库表被锁死的处理方法