NLP基础-BOW-影评分类
来源:互联网 发布:飞哥软件 编辑:程序博客网 时间:2024/05/16 14:46
本文使用方法来自CMU课程《Neural Networks for NLP》
1. BOW Bag of words
2. 语言python
3. 框架dynet
ps. 英文注解写得不好,望见谅
图解
- 第一部分数据读取
- 采用collections.defaultdict创建出可自增词典w2i(word to index),t2i(tag to index)
- 在读文件部分用yield方法返回循环结果,降低内存占有率
from collections import defaultdictimport timeimport randomimport dynet as dyimport numpy as np# Functions to read in the corpus# to initialize a dictionary with value of its length# for example input w2i["a"], it will automatically give 0 to w2i["a"], then w2i["b"] will be given 1w2i = defaultdict(lambda: len(w2i))t2i = defaultdict(lambda: len(t2i))UNK = w2i["<unk>"]# readdatadef read_dataset(filename): with open(filename, "r") as f: for line in f: tag, words = line.lower().strip().split(" ||| ") # yield is a generator and returns the next value to cut down memory cost # it is like an iterative return yield ([w2i[x] for x in words.split(" ")], t2i[tag])
- 第二部分初始化数据与模型建立
- 根据第一部分的方法初始化train list和dev list
- 由于我们只知道train.txt中的词,所以w2i应该在读dev之前停止
- BOW创建词的权重是依仗于词数和标签数,即是说,这个词和对应的标签有多大关系,所以记录nwords和ntags
- 建立model和trainer
- weight和bias依据nwords和ntags
train = list(read_dataset("../data/classes/train.txt"))# because we only know the words from train.txt, before we are going to read text.txt, we need to stop w2iw2i = defaultdict(lambda: UNK, w2i)dev = list(read_dataset("../data/classes/test.txt"))# number of wordsnwords = len(w2i)# number of tags(kinds of labels)ntags = len(t2i)# Start DyNet and define trainermodel = dy.Model()trainer = dy.AdamTrainer(model)# Define the modelW_sm = model.add_lookup_parameters((nwords, ntags)) # Word weightsb_sm = model.add_parameters((ntags)) # Softmax bias
- 第三部分计算分数
- 根据词语的权重dy.lookup(W_sm, x)之和(bag of words的含义)
- 再根据标签的bias得到每个标签的得分,即是说这句话在不同标签中能得多少分
# A function to calculate scores for one valuedef calc_scores(words): # Create a computation graph, and add parameters dy.renew_cg() b_sm_exp = dy.parameter(b_sm) # Take the sum of all the embedding vectors for each word score = dy.esum([dy.lookup(W_sm, x) for x in words]) # Add the bias vector and return return score + b_sm_exp
- 第四部分反复训练
- 迭代次数100,计算分数为-log(dy.softmax(e1)),概率越小neglog越大,pick label对应的neglog用作loss的计算
- backward训练降低loss
- 计算dev中句子的score,最大值对应的label即为predict的label,计算准确率
for ITER in range(100): # Perform training random.shuffle(train) train_loss = 0.0 start = time.time() for words, tag in train: # (pick(-log(dy.softmax(e1)), k)) my_loss = dy.pickneglogsoftmax(calc_scores(words), tag) train_loss += my_loss.value() my_loss.backward() trainer.update() print("iter %r: train loss/sent=%.4f, time=%.2fs" % (ITER, train_loss/len(train), time.time()-start)) # Perform testing test_correct = 0.0 for words, tag in dev: scores = calc_scores(words).npvalue() predict = np.argmax(scores) if predict == tag: test_correct += 1 print("iter %r: test acc=%.4f" % (ITER, test_correct/len(dev)))
阅读全文
0 0
- NLP基础-BOW-影评分类
- NLP基础-CBOW&DEEP CBOW-影评分类
- BoW用于图像分类
- BOW物体分类模型
- NLP:基础
- BOW
- BoW
- BoW
- BOW
- Bow
- 影评
- 影评
- 影评
- 影评
- NLP 分类问题的讨论
- 自然语言处理-NLP应用分类
- 一个用BoW|Pyramid BoW+SVM进行图像分类的Matlab Demo
- NLP基础--词性含义
- HTML笔记1
- qq跳转到指定qq聊天消息窗口界面
- 正则匹配的通用方法
- matlab 2012 vs2010混合编程
- SpringBoot01 java的配置方式
- NLP基础-BOW-影评分类
- 超大数减1
- spring boot 学习笔记(05)——热部署 之spring-boot-devtools
- scrapy 一次性提取多层嵌套标签的所有文本
- WechatWeb js方法
- 微信小程序----gallery slider(图片轮播)组件
- 编译bib文件,报错repeated entry
- [javase]二分法查找
- jQuery选择器