NLP、language model、lstm、attention model

来源：互联网发布：Linux隐藏服务器lp 编辑：程序博客网时间：2024/05/16 17:22

参考代码：https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/02-intermediate/language_model

lstm简单介绍：https://zhuanlan.zhihu.com/p/24720659

一、language model

语言模型的工作是计算一句话是否为正常的语言。

注意模型中:每个短句无重叠。

batch_size: 样本分批训练的批次大小

seq_len:是序列长度(人为定义大小，一般取30)，就是默认的语句长度

corpus:是字典集合，语料库。

第一步：将所有文本切词添加到语料库

#-*- coding:utf-8 -*-import torchimport osclass Dictionary(object):    def __init__(self):        self.word2idx = {}        self.idx2word = {}        self.idx = 0        def add_word(self, word):        if not word in self.word2idx:            self.word2idx[word] = self.idx            self.idx2word[self.idx] = word            self.idx += 1        def __len__(self):        return len(self.word2idx)    class Corpus(object):    def __init__(self, path='./data'):        self.dictionary = Dictionary()        self.train = os.path.join(path, 'train.txt')        self.test = os.path.join(path, 'test.txt')    def get_data(self, path, batch_size=20):        # Add words to the dictionary        with open(path, 'r') as f:            tokens = 0            for line in f:                words = line.split() + ['<eos>']                tokens += len(words)                for word in words:                     self.dictionary.add_word(word)                  # Tokenize the file content        # 长整型向量,保存整个文章中词的idx信息        ids = torch.LongTensor(tokens)         token = 0        with open(path, 'r') as f:            for line in f:                words = line.split() + ['<eos>']                for word in words:                    ids[token] = self.dictionary.word2idx[word]                    token += 1        num_batches = ids.size(0) // batch_size        ids = ids[:num_batches*batch_size]        return ids.view(batch_size, -1)

# Load Penn Treebank Datasettrain_path = './data/train.txt'sample_path = './sample.txt'corpus = Corpus()ids = corpus.get_data(train_path, batch_size) # 样本按顺序单词在字典库中的下标vocab_size = len(corpus.dictionary) # 整个字典库的收录单词大小num_batches = ids.size(1) // seq_length

第二步：将文本语料无交叉构造成训练数据矩阵

data矩阵的每一行表示一句话(长度固定30)，其中包含分句符。每个值为该句话中该单词所在语料库中的id值(一般用idx表示)。

target矩阵的每一行对应data矩阵每一行一句话向右平移后的(长度为30)的语句。

datas = Variable(ids[:, i:i+seq_length]) targets = Variable(ids[:, (i+1):(i+1)+seq_length].contiguous())

注: ids的大小为[ batchsize*(seq_length*t) ] 应该分t组数据(loader)处理。

第三步：建立模型

模型的输入为上述矩阵，输出为原语句平移1个单位后的语句。

第四步：查看语言模型预测效果

二、attention model

参考网址：https://zhuanlan.zhihu.com/p/22081325 这里的4个公式一定要看

pytorch-60分钟教程： http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

attention model翻译模型整体结构如下：

pytorch的60分钟程序是按顺序执行，翻译对。

如：

法语："vous etes fort ingenieuse . "，转换成下标形式为[8,1150,97,32,59,1]；存在于法语vocabulary集

对应英语： " you are vary clever . "转换成对应下标形式为[77,1150,4,8,11,1]；存在于英语vocabulary集

注：这里的是双集合，1代表<EOS>句子末尾。

翻译模型句子1和2长度可以不同，设置全局变量MAX_Length,按0补全空缺位置。

4个重要公式

train方法是单个epoch执行的部分

# 每个epochs中执行的部分，每个epochs执行一个语句def train(input_variable, target_variable, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):    # criterion：负对数似然损失函数    encoder_hidden = encoder.initHidden() # 初始化隐藏层    encoder_optimizer.zero_grad()         # 初始化优化器    decoder_optimizer.zero_grad()         # 初始化优化器    input_length = input_variable.size()[0] # 单个输入语句    target_length = target_variable.size()[0] # 单个标签正确翻译后的语句        # 站位符 输出长度匹配    encoder_outputs = Variable( torch.zeros(max_length, encoder.hidden_size) )    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs    loss = 0    ##############################  Encoder  ##############################    # 获取encoder对于单个语句的输出    for ei in range(input_length):        encoder_output, encoder_hidden = encoder( input_variable[ei], encoder_hidden)        encoder_outputs[ei] = encoder_output[0][0]    ############################## Attention Decoder ##############################    # decoder 是一步一步按照顺序执行seq中的word    decoder_input = Variable(torch.LongTensor([[SOS_token]]))    decoder_input = decoder_input.cuda() if use_cuda else decoder_input    decoder_hidden = encoder_hidden    # 随机算法    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False    if use_teacher_forcing:        # Teacher forcing: Feed the target as the next input        for di in range(target_length):            decoder_output, decoder_hidden, decoder_attention = decoder( decoder_input, decoder_hidden, encoder_output, encoder_outputs)            loss += criterion(decoder_output, target_variable[di])  # 计算误差            decoder_input = target_variable[di]  # Teacher forcing    else:        # Without teacher forcing: use its own predictions as the next input        for di in range(target_length):            decoder_output, decoder_hidden, decoder_attention = decoder( decoder_input, decoder_hidden, encoder_output, encoder_outputs)            topv, topi = decoder_output.data.topk(1)            ni = topi[0][0]            decoder_input = Variable(torch.LongTensor([[ni]]))            decoder_input = decoder_input.cuda() if use_cuda else decoder_input            loss += criterion(decoder_output, target_variable[di])  # 计算误差            if ni == EOS_token:  break    loss.backward()    encoder_optimizer.step()    decoder_optimizer.step()    return loss.data[0] / target_length

Encoder

注：输入input是word对应的idx值。该程序是按单词一次一次输入到encoder中，最后保留一个hidden层为Decoder做准备。

decoder_hidden = encoder_hidden

Simple Decoder

注：decode部分是按照word一个一个循环训练的

并且每次输入都要使用上一次的输出作为参数。

Attention Decoder

注：

这里的input是单个word,最开始的word是<SOS>句子的开始

encoder_outputs：是encoder模块层所有的输出。

bmm:对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操作。

阅读全文

0 0