Tensorflow练习1-对评论进行分类
来源:互联网 发布:树莓派java服务器 编辑:程序博客网 时间:2024/06/07 12:53
#-*- coding: utf-8 -*-import numpy as npimport tensorflow as tfimport randomimport picklefrom collections import Counterimport pdb import nltkfrom nltk.tokenize import word_tokenize"""'I'm super man'tokenize:['I', ''m', 'super','man' ] """from nltk.stem import WordNetLemmatizer"""词形还原(lemmatizer),即把一个任何形式的英语单词还原到一般形式,与词根还原不同(stemmer),后者是抽取一个单词的词根。"""nltk.download('wordnet') pos_file = 'pos.txt'neg_file = 'neg.txt' # 创建词汇表def create_lexicon(pos_file, neg_file):lex = []# 读取文件def process_file(f):with open(pos_file, 'r') as f:lex = []lines = f.readlines() # pdb.set_trace()#print(lines)for line in lines: try:words = word_tokenize(line.lower()) #将一行按照空白符来切分,得到一组单词集合lex += words except: pass# pdb.set_trace()return lexlex += process_file(pos_file)lex += process_file(neg_file) #lex是文件中所有的单词#print(len(lex))lemmatizer = WordNetLemmatizer()#lex = [lemmatizer.lemmatize(word) for word in lex] # 词形还原 (cats->cat) nlex = [] for word in lex: try: nlex += lemmatizer.lemmatize(word) #将单词的各种形式还原成原形,如cats还原成cat except:# pdb.set_trace() i = 1 lex = nlex word_count = Counter(lex) #统计单词出现的次数#print(word_count)# {'.': 13944, ',': 10536, 'the': 10120, 'a': 9444, 'and': 7108, 'of': 6624, 'it': 4748, 'to': 3940......}# 去掉一些常用词,像the,a and等等,和一些不常用词; 这些词对判断一个评论是正面还是负面没有做任何贡献lex = [] #只保留指定频次区间的单词 for word in word_count:if word_count[word] < 2000 and word_count[word] > 20: # 这写死了,好像能用百分比lex.append(word) # 齐普夫定律-使用Python验证文本的Zipf分布 http://blog.topspeedsnail.com/archives/9546return lex lex = create_lexicon(pos_file, neg_file)#lex里保存了文本中出现过的单词。 # 把每条评论转换为向量, 转换原理:# 假设lex为['woman', 'great', 'feel', 'actually', 'looking', 'latest', 'seen', 'is'] 当然实际上要大的多# 评论'i think this movie is great' 转换为 [0,1,0,0,0,0,0,1], 把评论中出现的字在lex中标记,出现过的标记为1,其余标记为0def normalize_dataset(lex):dataset = []# lex:词汇表;review:评论;clf:评论对应的分类,[0,1]代表负面评论 [1,0]代表正面评论 # 处理文件中的一行评论,将一条评论表示成一个向量 def string_to_vector(lex, review, clf):words = word_tokenize(review.lower())lemmatizer = WordNetLemmatizer()#words = [lemmatizer.lemmatize(word) for word in words] nwords = [] for word in words: try: nwords += lemmatizer.lemmatize(word) except: pass words = nwords features = np.zeros(len(lex))for word in words:if word in lex:features[lex.index(word)] = 1 # 一个句子中某个词可能出现两次,可以用+=1,其实区别不大return [features, clf] with open(pos_file, 'r') as f:lines = f.readlines()for line in lines: try:one_sample = string_to_vector(lex, line, [1,0]) # [array([ 0., 1., 0., ..., 0., 0., 0.]), [1,0]]dataset.append(one_sample) except: pass with open(neg_file, 'r') as f:lines = f.readlines()for line in lines: try:one_sample = string_to_vector(lex, line, [0,1]) # [array([ 0., 0., 0., ..., 0., 0., 0.]), [0,1]]]dataset.append(one_sample) except: pass#print(len(dataset))return dataset dataset = normalize_dataset(lex) #将每行评论表示成一个向量后的集合random.shuffle(dataset) #将list中的元素随机排序 # 取样本中的10%做为测试数据test_size = int(len(dataset) * 0.1)dataset = np.array(dataset)#dataset的格式:[[X1:Y1], [X2,Y2] ....] train_dataset = dataset[:-test_size]test_dataset = dataset[-test_size:] # Feed-Forward Neural Network# 定义每个层有多少'神经元''n_input_layer = len(lex) # 输入层 n_layer_1 = 1000 # hide layern_layer_2 = 1000 # hide layer(隐藏层)听着很神秘,其实就是除输入输出层外的中间层 n_output_layer = 2 # 输出层 # 定义待训练的神经网络def neural_network(data):# 定义第一层"神经元"的权重和biases, w: n_input_layer*n_layer_1 , b: n_layer_1*1layer_1_w_b = {'w_':tf.Variable(tf.random_normal([n_input_layer, n_layer_1])), 'b_':tf.Variable(tf.random_normal([n_layer_1]))}# 定义第二层"神经元"的权重和biaseslayer_2_w_b = {'w_':tf.Variable(tf.random_normal([n_layer_1, n_layer_2])), 'b_':tf.Variable(tf.random_normal([n_layer_2]))}# 定义输出层"神经元"的权重和biaseslayer_output_w_b = {'w_':tf.Variable(tf.random_normal([n_layer_2, n_output_layer])), 'b_':tf.Variable(tf.random_normal([n_output_layer]))} # w·x+blayer_1 = tf.add(tf.matmul(data, layer_1_w_b['w_']), layer_1_w_b['b_'])layer_1 = tf.nn.relu(layer_1) # 激活函数, 将为负数的元素改成0layer_2 = tf.add(tf.matmul(layer_1, layer_2_w_b['w_']), layer_2_w_b['b_'])layer_2 = tf.nn.relu(layer_2 ) # 激活函数layer_output = tf.add(tf.matmul(layer_2, layer_output_w_b['w_']), layer_output_w_b['b_']) return layer_output # 每次使用50条数据进行训练batch_size = 50 X = tf.placeholder('float', [None, len(train_dataset[0][0])]) #[None, len(train_x)]代表数据数据的高和宽(矩阵),好处是如果数据不符合宽高,tensorflow会报错,不指定也可以。Y = tf.placeholder('float')# 使用数据训练神经网络def train_neural_network(X, Y):predict = neural_network(X) #这就是我们训练的最终的神经网络模型#cost_func = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=predict, logits=Y)) cost_func = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=predict)) optimizer = tf.train.AdamOptimizer().minimize(cost_func) # learning rate 默认 0.001 epochs = 20with tf.Session() as session:session.run(tf.initialize_all_variables())for epoch in range(epochs): random.shuffle(train_dataset) #train_dataset的格式:[[X1:Y1], [X2,Y2] ....] train_x = train_dataset[:, 0] train_y = train_dataset[:, 1] i = 0epoch_loss = 0while i < len(train_x):start = iend = i + batch_sizebatch_x = train_x[start:end]batch_y = train_y[start:end]_, c = session.run([optimizer, cost_func], feed_dict={X:list(batch_x),Y:list(batch_y)})epoch_loss += ci += batch_sizeprint(epoch, ' : ', epoch_loss) text_x = test_dataset[: ,0]text_y = test_dataset[:, 1] #tf.argmax(predict,1) 实际就是从第2个维度,也就是每列中([0,1]/[1,0]) 中找出最大的值的索引,实际就是看哪一位为1 #predict和Y的格式都是:[[0,1],[1,0],.....]correct = tf.equal(tf.argmax(predict,1), tf.argmax(Y,1)) #这里predict就是将输入的X带入模型neural_network计算出来的accuracy = tf.reduce_mean(tf.cast(correct,'float'))print('准确率: ', accuracy.eval({X:list(text_x) , Y:list(text_y)})) train_neural_network(X,Y)
数据集下载地址:
neg.txt:5331条负面电影评论(http://blog.topspeedsnail.com/wp-content/uploads/2016/11/neg.txt)pos.txt:5331条正面电影评论 (http://blog.topspeedsnail.com/wp-content/uploads/2016/11/pos.txt)
训练结果为:
(0, ' : ', 9017.1933832168579)(1, ' : ', 4066.41361951828)(2, ' : ', 4068.9778349399567)(3, ' : ', 3021.8649171590805)(4, ' : ', 2755.1698242425919)(5, ' : ', 3285.6446791887283)(6, ' : ', 2699.1985047459602)(7, ' : ', 3897.1043330430984)(8, ' : ', 3554.3585470914841)(9, ' : ', 2428.1245975494385)(10, ' : ', 3937.8592742085457)(11, ' : ', 2276.5575492374992)(12, ' : ', 2440.6624698638916)(13, ' : ', 1014.4160533390241)(14, ' : ', 1225.6515423953533)(15, ' : ', 892.63348872936263)(16, ' : ', 822.74370710202493)(17, ' : ', 454.85909063366125)(18, ' : ', 165.94665694236755)(19, ' : ', 3.1606742052643568)('正确率: ', 0.51361501)
这正确率比瞎蒙好那么一丢丢?我要哭了,不过问题应该是出在数据集上吧。
原文地址:http://blog.topspeedsnail.com/archives/10399
但是原文代码有很多错误,我这里进行了一些修正
阅读全文
0 0
- TensorFlow练习1: 对评论进行分类
- TensorFlow练习1: 对评论进行分类
- Tensorflow练习1-对评论进行分类
- TensorFlow练习2: 对评论进行分类
- [TensorFlow实战练习]1-对电影评论的分类
- tensorflow练习1:利用神经网络进行分类
- 评论进行分类
- 基于对评论进行分类的持续运行模型
- 利用opencv3读取tensorflow model,对图像进行分类
- Python:用Word2Vec 和 sklearn 对IMDB评论进行分类训练
- 利用tensorflow进行单词分类
- [TensorFlow实战练习]2-对推特数据的情绪分析分类
- 对情况进行分类
- 对数据进行分类
- 对list进行分类
- 用深度神经网络对Iris数据集进行分类的程序--tensorflow
- 使用Tensorflow自定义一个线性分类器用于对“良/恶性乳腺癌肿瘤”进行预测
- 使用TensorFlow双流卷积神经网络对CK+表情数据库进行分类
- 30秒让你知道悲观锁和乐观锁的区别
- null
- redis支持的五种数据类型及其底层实现
- Windows下SQLMAP的安装图解
- linux0.00内核剖析之2.保护模式内存管理
- Tensorflow练习1-对评论进行分类
- javeWeb springMvc获取到的参数附带特殊符号,接收后被转义
- Vue router 参数传递
- @Component,@Service,@Controller,@Repository注解
- 4.值和单位
- Docker搭建JavaWeb运行环境
- Jodatime
- java创建多级目录文件
- Android动态加载第三方APK的View研究过程