nlp简单深度模型的代码套路

来源：互联网发布：幼儿英语教学软件编辑：程序博客网时间：2024/06/06 10:24

最近学习四个模型。

textCNN, LSTM(RNN,GRU), HAN, charCNN。

参考别人的博客实现了相关模型的代码，为了后续学习中能够快速写出自己的代码，特地在此进行这些模型的代码总结，这种模型除了原理不同，整体代码的流程都是差不多的。

一般采用一个模型去处理文本相关任务，写得tensorflow代码一般分为四个python文件。

data_helper.py（处理数据）
model.py （所用模型）
train.py （训练过程）
eval.py （验证模型）-----这个一般我不写，现在处于学习阶段，一般会在train.py中进行一些简单的验证。也可以在eval.py中对所有验证数据进行验证。

这四个文件里面，我觉得最重要的就是model.py和data_helper.py，但是有时候data_helper.py花的时间却非常多的。

下面总结四个文件主要写哪些东西，总结出一套流程，供自己使用与记忆。

1.首先是data_helper.py

在data_helper.py中，一般是处理数据的。从原始的文本数据，到最后模型直接处理的数据形式。（注意：关于nlp的深度模型一般都先经过embedding layer，那么我们在处理数据中只处理到以下状态。这个状态就是，把每个样本处理成以样本中word 在字典vocab的index索引表示的向量）

我们把样本处理成以 word的索引序号组成的向量即可，然后在embedding layer中，读入索引向量，通过tf.nn.embedding_lookup(embedding矩阵，input_x)。这样每个文本中的每个词，就表示成初始词向量，送到相关模型中，比如CNN或者LSTM等。（注意：这里embedding矩阵可以随机初始化，也可以用pre_trained的word2vec的词向量）

上面说的是铺垫，我们主要说一下data_helper.py中的一些操作。

（1）看了多个模型的代码，最常见的就是写几个函数，分别去处理原始数据，不管你怎么处理，反正最后能获取想要的数据格式就行了。

（2）还有一种写法，就是写一个class，然后在class中写上多个函数，然后相关重要的数据对象以及一些参数，作为这个class类的属性。（个人感觉这种写法阅读上去有点繁琐，不过这应该是一种比较好的代码风格，更像是有经验的程序员写出来的代码）

（3）怎么处理数据，就不说了。这里主要说一下如何构造batches。我个人比较习惯dennybritz大神的代码风格，因为入门是看他的代码的，可能比较习惯了。

一般这一块的代码我都是直接用的。不需要自己再写了，减少时间。

batches=data_helper.batch_iter(list(zip(x_train,y_train)),FLAGS.batch_size,FLAGS.num_epochs,shuffle=True)

#从上面的一行代码，我们可以看到batch_iter函数的data参数是一个列表，列表元素是一个样本x和一个标签y,然后生成一个生成器

def batch_iter(data,batch_size,num_epochs,shuffle=True):
    data=np.array(data)
    data_size=len(data)
    num_batches_per_epoch=int((data_size-1)/batch_size)+1
    for epoch in range(num_epochs):
        if shuffle:
            shuffle_indices=np.random.permutation(np.arange(data_size))
            shuffled_data=data[shuffle_indices]
        else:
            shuffled_data=data
        for batch_num in range(num_batches_per_epoch):
            start_index=batch_num*batch_size
            end_index=min(batch_size*(batch_num+1),data_size)
            yield shuffled_data[start_index:end_index]



下面是训练时使用bathces的代码。
#整个训练过程：然后在训练过程中，对每一个batch进行训练或者验证
        for batch in batches:#[(x1,y1),(x2,y2),...,(x64,y64)]
            x_batch,y_batch=zip(*batch)     #[x1,x2,...x64]  [y1,y2,...,y64]
            train_step(x_batch,y_batch)
            current_step=tf.train.global_step(sess,global_step)
            if current_step%FLAGS.evaluation_every==0:
                print("\nEvaluation....")
                dev_step(x_dev,y_dev)
                print ('\n')
            if current_step%FLAGS.checkpoint_every==0:
                path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                print("Saved model checkpoint to {}\n".format(path))

总结：一般我在写data_hleper.py时候，一般先处理原始数据，生成一个以word在字典的索引序号表示样本doc的向量的data_x矩阵和data_y矩阵，每一个y都是一个向量，这个怎么处理就不说了，有个简单的方法是用np.eye(),非常方便。还有一点值得注意，不管是在构建data_x的时候，还是在构建batches的时候，建议都使用shuffle随机打乱数据。

2.model.py

这几个文件当中最重要的文件了。也是最能学到东西的代码。会用到很多函数的调用。

这里也分两种风格来说。

（1）一般的话，当模型简单，只有简单的一层cnn或者其它网络，或者多层相同的网络的话，一般按照tensor数据处理的需要一步一步写就行了。

（2）另外一种风格的话，就是每一步数据的处理写成一个函数。这个也非常好理解。

下面总结一个常用的流程出来。

一般就采用下面的代码形式，下面代码我只是搬抄过来的，只要按着这个流程根据自己tensor的shape一步一步写就可以。

（1）with tf.name_scope('placeholder'):

为需要的数据进行占坑（占位符）

（2）#keeping track of l2 regularization loss (optional)

l2_losses=tf.constant(0.0)

（3）#Embedding layer

with tf.device('/cpu:0'), tf.name_scope('embedding'):

（4）for i , conv in enumerate(conv_layers):（多层....）

直接在进行池化操作：

# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b),
name="relu") # 卷积后的操作结果是[64,48,1,128]/[64,47,1,128]/[64,46,1,128]

# Maxpooling over the outputs
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")

(5)# Add dropout

with tf.name_scope("dropout"):

self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

（6）输出层

with tf.name_scope('output'):

outlayer_weights=tf.Variable(tf.truncated_normal(
outlayer_b=tf.Variable(tf.constant(
l2_losses += tf.nn.l2_loss(outlayer_weights)
l2_losses+=tf.nn.l2_loss(outlayer_b)
.y_pred=tf.nn.xw_plus_b(
.predictions=tf.argmax(

（7）#计算平均交叉熵

with tf.name_scope('loss'):

losses=tf.nn.softmax_cross_entropy_with_logits(
.loss=tf.reduce_mean(losses)+l2_reg_lambda*l2_losses

（8）准确度

with tf.name_scope('accuracy'):

.correct_predictions=tf.equal(
.accuracy=tf.reduce_mean(tf.cast(

3.train.py这个文件每个模型代码基本上差不多

（1）首先有参数的可以选择打印这些参数.....

FLAGS=tf.flags.FLAGS

FLAGS._parse_flags()

print ('all related parameters in :')

for attr,value in sorted(FLAGS.__flags.items()):

print ('{}={}'.format(attr.upper(),value))

（2）接着加载数据，得到train_x, train_y, dev_x,dev_y

（3）然后定义Session,加载模型

with tf.Graph().as_default():

session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)
sess=tf.Session(config=session_config)
with sess.as_default():

charcnn=charCNN(config.l0,config.num_classes,config.model.conv_layers,config.model.fc_layers,l2_reg_lambda=0.0)

（4）定义这四个重要的变量。

global_step=tf.Variable(initial_value=0,trainable=False,name='global_step')

optimizer=tf.train.AdamOptimizer(config.model.learning_rate)

grads_and_vars=optimizer.compute_gradients(charcnn.loss)

train_op=optimizer.apply_gradients(grads_and_vars,global_step)

（5）记录summaries (这一部分是记录计算过程中一些scalars graph, histogram等)

固定代码，基本上都一样.....

#keep the track of gradient values and sparsity

grad_summaries=[]

for g,v in grads_and_vars:

if g is not None:

grad_hist_summary=tf.summary.histogram('{}/grad/hist'.format(v.name),g)
sparsity_summary=tf.summary.scalar('{}/grad/sparsity'.format(v.name),tf.nn.zero_fraction(g))
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged=tf.summary.merge(grad_summaries)

#Summaries for loss and accuracy

loss_summary=tf.summary.scalar('loss',charcnn.loss)

accuracy_summary=tf.summary.scalar('accuracy',charcnn.accuracy)

#train summaries

train_summary_op = tf.summary.merge([loss_summary,accuracy_summary,grad_summaries_merged])

train_summary_dir=os.path.join(out_dir,'summaries','train')

train_summary_writer=tf.summary.FileWriter(train_summary_dir,sess.graph)

#dev summaries

dev_summary_op = tf.summary.merge([loss_summary,accuracy_summary])

dev_summary_dir=os.path.join(out_dir,'summaries','dev')

dev_summary_writer=tf.summary.FileWriter(dev_summary_dir,sess.graph)

（6）定义单次train_step(),和dev_step()

def train_step(x_batch,y_batch):

feed_dict={charcnn.input_x:x_batch,
charcnn.input_y:y_batch,
charcnn.dropout_keep_prob:config.model.dropout_keep_prob}

_,step,summaries,loss,accuracy=sess.run([train_op,global_step,train_summary_op,charcnn.loss,charcnn.accuracy],feed_dict=feed_dict)
nowtime=datetime.datetime.now().isoformat()
print ('{}: step {}, loss {:g}, accuracy {:g}'.format(nowtime,step,loss,accuracy))
train_summary_writer.add_summary(summaries,step)

def dev_step(x_batch,y_batch):

feed_dict = {charcnn.input_x: x_batch,
charcnn.input_y: y_batch,
charcnn.dropout_keep_prob: config.model.dropout_keep_prob}

step, summaries, loss, accuracy = sess.run([global_step, dev_summary_op, charcnn.loss, charcnn.accuracy],feed_dict=feed_dict)
nowtime = datetime.datetime.now().isoformat()
print('{}: step {}, loss {:g}, accuracy {:g}'.format(nowtime, step, loss, accuracy))
dev_summary_writer.add_summary(summaries, step)

（7）训练过程

batches = train_data.batch_iter(list(zip(train_x,train_y)), config.batch_size, num_epochs=5, shuffle=True)

for batch in batches:

x_batch, y_batch = zip(*batch)
train_step(x_batch, y_batch)
current_step = tf.train.global_step(sess, global_step)

if current_step % config.training.evaluate_every == 0:
print('\nEvaluation!')
dev_step(x_batch, y_batch)
print('\n')

if current_step % config.training.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print("Saved model checkpoint to {}\n".format(path))

阅读全文

0 0