NLP基础-BOW-影评分类

来源:互联网 发布:飞哥软件 编辑:程序博客网 时间:2024/05/16 14:46

本文使用方法来自CMU课程《Neural Networks for NLP》


1. BOW Bag of words

2. 语言python

3. 框架dynet

ps. 英文注解写得不好,望见谅


图解
这里写图片描述

  • 第一部分数据读取
    1. 采用collections.defaultdict创建出可自增词典w2i(word to index),t2i(tag to index)
    2. 在读文件部分用yield方法返回循环结果,降低内存占有率
from collections import defaultdictimport timeimport randomimport dynet as dyimport numpy as np# Functions to read in the corpus# to initialize a dictionary with value of its length# for example input w2i["a"], it will automatically give 0 to w2i["a"], then w2i["b"] will be given 1w2i = defaultdict(lambda: len(w2i))t2i = defaultdict(lambda: len(t2i))UNK = w2i["<unk>"]# readdatadef read_dataset(filename):  with open(filename, "r") as f:    for line in f:      tag, words = line.lower().strip().split(" ||| ")      # yield is a generator and returns the next value to cut down memory cost      # it is like an iterative return      yield ([w2i[x] for x in words.split(" ")], t2i[tag])
  • 第二部分初始化数据与模型建立
    1. 根据第一部分的方法初始化train list和dev list
    2. 由于我们只知道train.txt中的词,所以w2i应该在读dev之前停止
    3. BOW创建词的权重是依仗于词数和标签数,即是说,这个词和对应的标签有多大关系,所以记录nwords和ntags
    4. 建立model和trainer
    5. weight和bias依据nwords和ntags
train = list(read_dataset("../data/classes/train.txt"))# because we only know the words from train.txt, before we are going to read text.txt, we need to stop w2iw2i = defaultdict(lambda: UNK, w2i)dev = list(read_dataset("../data/classes/test.txt"))# number of wordsnwords = len(w2i)# number of tags(kinds of labels)ntags = len(t2i)# Start DyNet and define trainermodel = dy.Model()trainer = dy.AdamTrainer(model)# Define the modelW_sm = model.add_lookup_parameters((nwords, ntags)) # Word weightsb_sm = model.add_parameters((ntags))                # Softmax bias
  • 第三部分计算分数
    1. 根据词语的权重dy.lookup(W_sm, x)之和(bag of words的含义)
    2. 再根据标签的bias得到每个标签的得分,即是说这句话在不同标签中能得多少分
# A function to calculate scores for one valuedef calc_scores(words):  # Create a computation graph, and add parameters  dy.renew_cg()  b_sm_exp = dy.parameter(b_sm)  # Take the sum of all the embedding vectors for each word  score = dy.esum([dy.lookup(W_sm, x) for x in words])  # Add the bias vector and return  return score + b_sm_exp
  • 第四部分反复训练
    1. 迭代次数100,计算分数为-log(dy.softmax(e1)),概率越小neglog越大,pick label对应的neglog用作loss的计算
    2. backward训练降低loss
    3. 计算dev中句子的score,最大值对应的label即为predict的label,计算准确率
for ITER in range(100):  # Perform training  random.shuffle(train)  train_loss = 0.0  start = time.time()  for words, tag in train:    # (pick(-log(dy.softmax(e1)), k))    my_loss = dy.pickneglogsoftmax(calc_scores(words), tag)    train_loss += my_loss.value()    my_loss.backward()    trainer.update()  print("iter %r: train loss/sent=%.4f, time=%.2fs" % (ITER, train_loss/len(train), time.time()-start))  # Perform testing  test_correct = 0.0  for words, tag in dev:    scores = calc_scores(words).npvalue()    predict = np.argmax(scores)    if predict == tag:      test_correct += 1  print("iter %r: test acc=%.4f" % (ITER, test_correct/len(dev)))
原创粉丝点击