se2seq practise keynote

来源：互联网发布：淘宝发顺丰多少钱一个编辑：程序博客网时间：2024/05/16 16:18

data processing

original training data(dev data) contains two files, English text and French text, same line are corresponding translations of each other.

we need to combine corresponding sentences to a tuple,
and use buckets to compromise between padding length and model structure (RNN steps).

The data classified by bucket looks like the following:

[ [ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ] ]

each element is an integer, with 4 buckets each tuple looks like ([5, 10, …, 45], [12, 14, …, 78])

The data padded looks like the following (batch_size = 1):

encoder inputs: [ei1,ei2,...eik,ep1,ep2,...ept]

k = len(encoder_sentence) ; t + k =bucket[0] = encoder_size

deocder_inputs: [GO,di1,di2,...,dik,dp1,dp2,...,dpt]

k = len(decoder_sentence) ; t + k +1 = bucket[1] = decoder_size

decoder_targets: [dii,di2,di3,...,dp1,dp2,dp3,...,0]

decoder_targets is decoder_inputs shift by one

ei: encoder inputs
di: decoder inputs
ep: encoder padding
dp: decoder padding

len(target_weights) = len(decoder_inputs) = len(decoder_targets)

target_weights is a list ,element is 0 or 1, controls sequence_loss, that is, the step covering the last target or the target is PAD is ignored by sequence_loss.

这里写图片描述

training process

manually set iteration numbers, during each iteration, randomly pick training data of one bucket id
using get_batch function to randomly pick some training data inside the chosen dataset of specific bucket id(for small corpus)
Run validation on all bucket of dev data, calculate the average belu score
Over all iterations, save the best model with checkpoint, the metrics is belu score.

using self_test to test if model can get through or not

corpus too large solution

when corpus is too large, we can’t fit it into a list considering the limited memory, yet can’t yield data from the complete data.
we can break the corpus into several chunks, one training feed in one chunk, and next chunk later.
outer epoch loop and inner batch loop are no different. between batches(and between epoch loops), only weights(trainable variables) are updated, rnn state(not variables, not updated) is reset(default: zero_state) when a new batch is fed.The state reflects information of a sentence.so different sequence’s state are independent.so in order to make chunks the same as batches, just outside the batch loop, inside the epoch loop,feed the chunks,that is, **each epoch loop, feed one chunk.**when all epochs finished, the large corpus is finished.the epoch here doesn’t mean to loop over the whole dataset, but one chunk, cause the corpus is large enough, no need to loop over it many times.

model ckpt only needed to restore model later, for tuning or other purpose.during epoch or batch loop, no need to use ckpt to restore model.the weights will be updated automatically based on previous loop.

get_batch

yield & shuffle and random.choice & append to make batch are similar considering memory used.

code logic

using seq2seq library in tensorflow/python/ops/seq2seq.py

model.py ---- build seq2seq modeldata_utils.py ---- prepare datatrain.py ---- start traininginfer.py ---- start inferring(decode)

some notes

data pre-processing

_EOS symbols is added to both encoder inputs and decoder inputs to denote the end of a sentence, treat _EOS as original part of a sentence
_PAD and _GO symbol are added in get_batch for decoder_inputs, _PAD symbol is added for encoder inputs, after _EOS anotation
data tokenizer need to be modified
data_utils.py use b”str”, faster than ordinary str?
vocab data is too large, write to disk file
file too large read it using readline() instead of read() or readlines()
gfile.GFile() has similar function as python open(), maybe useful in google cloud
dev data is small, can fit into memory, using read_dev_data(), open source and target corpus simultaneously, read line by line, create tuple, judge which bucket list to fit into, and return 2-d buckets list
train data is too large, cannot fit into memory.
1. divide one large corpus file into several pieces of files, using itertools
2. read_dev_data to create pieces of 2-d bucket list using pieces of source and target files,if merging those list to obtain the full 2-d bucket list, too much memory consumed
3. wirte(append mode) the result of 2 to len(buckets) num of files, instead of using list to store the dataset, resulting in bucket0, bucket1,… a list of file

this process is like map and reduce operation.

training

before training ,we need to confirm whether the model should be loaded from previous ckpt file or created with fresh parameters
if not training one corpus just one time, that is, training the first part and checkpoint, and then start the second part, we’ll lost the best_loss information(for saving the best model), must pass by the best_loss between to training process, this is not recommended, better training in one time, if corpus too large, just break it into small corpus, loop over small ones.
during one epoch, the bucket_seq2seq model only takes data of one bucket(the data items are all of the same bucket, may be one bucket chunks), training this, and validation on all buckets of dev data, choose the loss as the metric of model performance, among epochs, save the one with minimum loss
validation metrics must be changed to belu score in the future
learning rate decrease during epochs
need add infer(beam_search)

阅读全文

0 0