NLP基础-BOW-影评分类

来源：互联网发布：飞哥软件编辑：程序博客网时间：2024/05/16 14:46

本文使用方法来自CMU课程《Neural Networks for NLP》

1. BOW Bag of words

2. 语言python

3. 框架dynet

ps. 英文注解写得不好，望见谅

图解
这里写图片描述

第一部分数据读取
1. 采用collections.defaultdict创建出可自增词典w2i(word to index)，t2i(tag to index)
2. 在读文件部分用yield方法返回循环结果，降低内存占有率

from collections import defaultdictimport timeimport randomimport dynet as dyimport numpy as np# Functions to read in the corpus# to initialize a dictionary with value of its length# for example input w2i["a"], it will automatically give 0 to w2i["a"], then w2i["b"] will be given 1w2i = defaultdict(lambda: len(w2i))t2i = defaultdict(lambda: len(t2i))UNK = w2i["<unk>"]# readdatadef read_dataset(filename):  with open(filename, "r") as f:    for line in f:      tag, words = line.lower().strip().split(" ||| ")      # yield is a generator and returns the next value to cut down memory cost      # it is like an iterative return      yield ([w2i[x] for x in words.split(" ")], t2i[tag])

第二部分初始化数据与模型建立
1. 根据第一部分的方法初始化train list和dev list
2. 由于我们只知道train.txt中的词，所以w2i应该在读dev之前停止
3. BOW创建词的权重是依仗于词数和标签数，即是说，这个词和对应的标签有多大关系，所以记录nwords和ntags
4. 建立model和trainer
5. weight和bias依据nwords和ntags

train = list(read_dataset("../data/classes/train.txt"))# because we only know the words from train.txt, before we are going to read text.txt, we need to stop w2iw2i = defaultdict(lambda: UNK, w2i)dev = list(read_dataset("../data/classes/test.txt"))# number of wordsnwords = len(w2i)# number of tags(kinds of labels)ntags = len(t2i)# Start DyNet and define trainermodel = dy.Model()trainer = dy.AdamTrainer(model)# Define the modelW_sm = model.add_lookup_parameters((nwords, ntags)) # Word weightsb_sm = model.add_parameters((ntags))                # Softmax bias

第三部分计算分数
1. 根据词语的权重dy.lookup(W_sm, x)之和（bag of words的含义）
2. 再根据标签的bias得到每个标签的得分，即是说这句话在不同标签中能得多少分

# A function to calculate scores for one valuedef calc_scores(words):  # Create a computation graph, and add parameters  dy.renew_cg()  b_sm_exp = dy.parameter(b_sm)  # Take the sum of all the embedding vectors for each word  score = dy.esum([dy.lookup(W_sm, x) for x in words])  # Add the bias vector and return  return score + b_sm_exp

第四部分反复训练
1. 迭代次数100，计算分数为-log(dy.softmax(e1))，概率越小neglog越大，pick label对应的neglog用作loss的计算
2. backward训练降低loss
3. 计算dev中句子的score，最大值对应的label即为predict的label，计算准确率

for ITER in range(100):  # Perform training  random.shuffle(train)  train_loss = 0.0  start = time.time()  for words, tag in train:    # (pick(-log(dy.softmax(e1)), k))    my_loss = dy.pickneglogsoftmax(calc_scores(words), tag)    train_loss += my_loss.value()    my_loss.backward()    trainer.update()  print("iter %r: train loss/sent=%.4f, time=%.2fs" % (ITER, train_loss/len(train), time.time()-start))  # Perform testing  test_correct = 0.0  for words, tag in dev:    scores = calc_scores(words).npvalue()    predict = np.argmax(scores)    if predict == tag:      test_correct += 1  print("iter %r: test acc=%.4f" % (ITER, test_correct/len(dev)))

阅读全文

0 0