MXNet官方文档教程(3):基于多层LSTM的字符级语言模型

来源:互联网 发布:php软件指的是什么 编辑:程序博客网 时间:2024/05/21 08:57

这是MXNet继上一篇我们介绍的人工神经网络识别手写数字之后另一个进阶(Advanced)示例,本文使用了最新的LSTM模型。由于本人对自然语言处理方向并无深入了解,故只进行了简单的直译,具体细节术语可查看相关文献及源网站:Character-level language models。


本教程讲授如何通过多层循环人工神经网络来训练一个字符级的语言模型。特别的,我们将训练了一个可以识别美国总统奥巴马演讲的多层LSTM网络。

 

数据准备

我们首先现在数据集并输出前面一部分字符。

import os

import urllib

import zipfile

if not os.path.exists("char_lstm.zip"):

    urllib.urlretrieve("http://data.mxnet.io/data/char_lstm.zip","char_lstm.zip")

with zipfile.ZipFile("char_lstm.zip","r")as f:

    f.extractall("./")    

with open('obama.txt','r')as f:

    print f.read()[0:1000]

输出结果:

Call to Renewal Keynote Address Call to Renewal Pt 1Call to Renewal Part 2 TOPIC: Our Past, Our Future & Vision for America June 
28, 2006 Call to Renewal' Keynote Address Complete Text Good morning. I appreciate the opportunity to speak here at the Call to R
enewal's Building a Covenant for a New America conference. I've had the opportunity to take a look at your Covenant for a New Ame
rica. It is filled with outstanding policies and prescriptions for much of what ails this country. So I'd like to congratulate yo
u all on the thoughtful presentations you've given so far about poverty and justice in America, and for putting fire under the fe
et of the political leadership here in Washington.But today I'd like to talk about the connection between religion and politics a
nd perhaps offer some thoughts about how we can sort through some of the often bitter arguments that we've been seeing over the l
ast several years.I do so because, as you all know, we can affirm the importance of povert

之后,我们定义了一些函数来对这些数据进行预处理。

def read_content(path):
    with open(path)as ins:        
        return ins.read()
    
# Return a dict which maps each char into an unique int id
def build_vocab(path):
    content =list(read_content(path))
    idx =1# 0 is left for zero-padding
    the_vocab = {}
    for word in content:
        if len(word)==0:
            continue
        if not word in the_vocab:
            the_vocab[word] = idx
            idx +=1
    return the_vocab
 
# Encode a sentence with int ids
def text2id(sentence, the_vocab):
    words =list(sentence)
    return [the_vocab[w] for w in words if len(w)>0]
            
# build char vocabluary from input
vocab= build_vocab("./obama.txt")
print('vocab size = %d'%(len(vocab)))

输出:

vocab size = 83

建立LSTM模型

现在我们建立一个多层LSTM模型。LSTM单元定义的实现在lstm.py中:

import lstm
# Each line contains at most 129 chars. 
seq_len=129
# embedding dimension, which maps a character to a 256-dimension vector
num_embed=256
# number of lstm layers
num_lstm_layer=3
# hidden unit in LSTM cell
num_hidden=512
 
symbol= lstm.lstm_unroll(
    num_lstm_layer, 
    seq_len,
    len(vocab)+1,
    num_hidden=num_hidden,
    num_embed=num_embed,
    num_label=len(vocab)+1,
    dropout=0.2)

 

训练

首先我们创建一个数据迭代器:

import bucket_io
 
# The batch size for training
batch_size=32
 
# initalize states for LSTM
init_c= [('l%d_init_c'%l, (batch_size, num_hidden)) for l inrange(num_lstm_layer)]
init_h= [('l%d_init_h'%l, (batch_size, num_hidden)) for l inrange(num_lstm_layer)]
init_states= init_c + init_h
 
# Even though BucketSentenceIter supports various length examples,
# we simply use the fixed length version here
data_train= bucket_io.BucketSentenceIter(
    "./obama.txt",
    vocab, 
    [seq_len], 
    batch_size,             
    init_states, 
    seperate_char='\n',
    text2id=text2id,
    read_content=read_content)

输出:

Summary of dataset ==================
bucket of len 129 : 8290 samples

然后我们使用标准model.fit实现来训练:

import mxnet as mx
import numpy as np
import logging
logging.getLogger().setLevel(logging.DEBUG)
 
# We will show a quick demo with only 1 epoch. In practice, we can set it to be 100
num_epoch=1
# learning rate 
learning_rate=0.01
 
# Evaluation metric
def Perplexity(label, pred):
    loss =0.
    for i inrange(pred.shape[0]):
        loss +=-np.log(max(1e-10, pred[i][int(label[i])]))
    return np.exp(loss/ label.size)
 
model= mx.model.FeedForward(
    ctx=mx.gpu(0),
    symbol=symbol,
    num_epoch=num_epoch,
    learning_rate=learning_rate,
    momentum=0,
    wd=0.0001,
    initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
 
model.fit(X=data_train,
          eval_metric=mx.metric.np(Perplexity),
          batch_end_callback=mx.callback.Speedometer(batch_size,20),
          epoch_end_callback=mx.callback.do_checkpoint("obama"))

输出:

          batch_end_callback=mx.callback.Speedometer(batch_size,20),
          epoch_end_callback=mx.callback.do_checkpoint("obama"))
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [20]   Speed: 36.09 samples/sec    Train-Perplexity=38.167996
INFO:root:Epoch[0] Batch [40]   Speed: 34.29 samples/sec    Train-Perplexity=24.568035
INFO:root:Epoch[0] Batch [60]   Speed: 34.32 samples/sec    Train-Perplexity=23.439121
INFO:root:Epoch[0] Batch [80]   Speed: 34.26 samples/sec    Train-Perplexity=23.209663
INFO:root:Epoch[0] Batch [100]  Speed: 34.28 samples/sec    Train-Perplexity=22.835044
INFO:root:Epoch[0] Batch [120]  Speed: 34.29 samples/sec    Train-Perplexity=22.745794
INFO:root:Epoch[0] Batch [140]  Speed: 34.29 samples/sec    Train-Perplexity=22.500408
INFO:root:Epoch[0] Batch [160]  Speed: 34.23 samples/sec    Train-Perplexity=22.543436
INFO:root:Epoch[0] Batch [180]  Speed: 34.24 samples/sec    Train-Perplexity=22.566656
INFO:root:Epoch[0] Batch [200]  Speed: 34.30 samples/sec    Train-Perplexity=22.378215
INFO:root:Epoch[0] Batch [220]  Speed: 34.31 samples/sec    Train-Perplexity=22.458195
INFO:root:Epoch[0] Batch [240]  Speed: 34.30 samples/sec    Train-Perplexity=22.655659
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=241.197
INFO:root:Saved checkpoint to "obama-0001.params"

推理

我们首先定义一些效用函数来帮助我们进行推理:

from rnn_model import LSTMInferenceModel
 
 
# helper strcuture for prediction
def MakeRevertVocab(vocab):
    dic = {}
    for k, v in vocab.items():
        dic[v] = k
    return dic
 
# make input from char
def MakeInput(char, vocab, arr):
    idx = vocab[char]
    tmp = np.zeros((1,))
    tmp[0]= idx
    arr[:] = tmp
 
# helper function for random sample 
def _cdf(weights):
    total =sum(weights)
    result = []
    cumsum =0
    for w in weights:
        cumsum += w
        result.append(cumsum/ total)
    return result
 
def _choice(population, weights):
    assertlen(population)==len(weights)
    cdf_vals = _cdf(weights)
    x = random.random()
    idx = bisect.bisect(cdf_vals, x)
    return population[idx]
 
# we can use random output or fixed output by choosing largest probability
def MakeOutput(prob, vocab, sample=False, temperature=1.):
    if sample ==False:
        idx = np.argmax(prob, axis=1)[0]
    else:
        fix_dict = [""]+ [vocab[i] for i inrange(1,len(vocab)+1)]
        scale_prob = np.clip(prob,1e-6,1-1e-6)
        rescale = np.exp(np.log(scale_prob)/ temperature)
        rescale[:] /= rescale.sum()
        return _choice(fix_dict, rescale[0, :])
    try:
        char = vocab[idx]
    except:
        char =''
    return char

之后我们可以建立推理模型:

import rnn_model
 
# load from check-point
_, arg_params, __ = mx.model.load_checkpoint("obama",75)
 
# build an inference model
model= rnn_model.LSTMInferenceModel(
    num_lstm_layer,
    len(vocab)+1,
    num_hidden=num_hidden,
    num_embed=num_embed,
    num_label=len(vocab)+1,
    arg_params=arg_params,
    ctx=mx.gpu(),
    dropout=0.2)

现在我们可以产生一个以“The United States”开头的600字符的序列:

seq_length=600
input_ndarray= mx.nd.zeros((1,))
revert_vocab= MakeRevertVocab(vocab)
# Feel free to change the starter sentence
output='The United States'
random_sample=False
new_sentence=True
 
ignore_length=len(output)
 
for i inrange(seq_length):
    if i <= ignore_length -1:
        MakeInput(output[i], vocab, input_ndarray)
    else:
        MakeInput(output[-1], vocab, input_ndarray)
    prob = model.forward(input_ndarray, new_sentence)
    new_sentence =False
    next_char = MakeOutput(prob, revert_vocab, random_sample)
    if next_char =='':
        new_sentence =True
    if i >= ignore_length -1:
        output += next_char
print(output)

输出:

The United States of America. That's why I'm running for President.The first place we can do better than that they can afford to get the that they can afford to differ on the part of the political settlement. The second part of the problem is that the consequences would have to see the chance to starthe country that we can start by the challenges of the American people. The American people have been talking about how to compete with the streets of San Antonio who are serious about the courage to come together as one people. That the American people have been trying to get there. And they say

 

1 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 双人骑自行车车坐压往后仰了怎么办 穿高帮钉鞋踢人造草出现水泡怎么办 耐克大吕布鞋面上的标志起来怎么办 国三柴油车环保查尾气超标怎么办 手上被玻璃手上扎进玻璃丝怎么办 月子腰疼怎么办能养过来吗 躺太久屁股长一粒一粒的痒怎么办啊 乳腺癌两年后手臂疼胳膊肿怎么办 胳膊疼就和抻筋是的怎么办 四岁胳膊扭了肿了怎么办 干燥综合症使皮肤变黑色了怎么办 刚生的小孩脖子上有淋巴结怎么办 地包天矫正期间嘴歪了怎么办 5个月的宝宝歪胫怎么办 三个多月的宝宝有点斜颈怎么办 5月宝宝有点斜颈偏向左侧怎么办 外斜术后变周期性内斜怎么办 右腰劳损休息一个多月还痛怎么办 宝宝鼻子不通气用嘴呼吸怎么办 窦性心律过速怎么办st-t改变 跑步机跑带上撒油跑起来滑怎么办 娃儿上嘴唇里面破了怎么办图片 支付宝添加好友点忽略了怎么办 如果综合评价考了两间大学怎么办 国二报名时民族错了怎么办 屁股凹陷下去一块已经萎缩了怎么办 我家孩子上高中了英语不好怎么办 驼背脖子向前倾怎么办用背背佳 屁股大又肥是骨盆前倾怎么办 手术40多天引起小腿变细了怎么办 老师教育学生学生跑出教室你怎么办 有矛盾舍友发朋友圈攻击我怎么办 飞机杯送的润滑油很快用完了怎么办 人笨记性不好内向反应迟钝该怎么办 洗衣机滋生了大量的霉菌在怎么办? 爸妈给兄弟买的房子贵很多怎么办 输了十几万 不敢跟爸妈讲怎么办 赌博欠2万不敢和爸妈讲怎么办 学校凳子坏了丢了 现在要查怎么办 我想写字可就是写的不好看怎么办 我的字写的很丑怎么办