se2seq practise keynote

来源:互联网 发布:淘宝发顺丰多少钱一个 编辑:程序博客网 时间:2024/05/16 16:18

data processing

original training data(dev data) contains two files, English text and French text, same line are corresponding translations of each other.

we need to combine corresponding sentences to a tuple,
and use buckets to compromise between padding length and model structure (RNN steps).

The data classified by bucket looks like the following:

[ [ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ],
[ ( , ), ( , ), …, ( , ) ] ]

each element is an integer, with 4 buckets each tuple looks like ([5, 10, …, 45], [12, 14, …, 78])

The data padded looks like the following (batch_size = 1):

encoder inputs: [ei1,ei2,...eik,ep1,ep2,...ept]

k = len(encoder_sentence) ; t + k =bucket[0] = encoder_size

deocder_inputs: [GO,di1,di2,...,dik,dp1,dp2,...,dpt]

k = len(decoder_sentence) ; t + k +1 = bucket[1] = decoder_size

decoder_targets: [dii,di2,di3,...,dp1,dp2,dp3,...,0]

decoder_targets is decoder_inputs shift by one

ei: encoder inputs
di: decoder inputs
ep: encoder padding
dp: decoder padding

len(target_weights) = len(decoder_inputs) = len(decoder_targets)

target_weights is a list ,element is 0 or 1, controls sequence_loss, that is, the step covering the last target or the target is PAD is ignored by sequence_loss.

这里写图片描述

training process

  • manually set iteration numbers, during each iteration, randomly pick training data of one bucket id
  • using get_batch function to randomly pick some training data inside the chosen dataset of specific bucket id(for small corpus)
  • Run validation on all bucket of dev data, calculate the average belu score
  • Over all iterations, save the best model with checkpoint, the metrics is belu score.

using self_test to test if model can get through or not

corpus too large solution

when corpus is too large, we can’t fit it into a list considering the limited memory, yet can’t yield data from the complete data.
we can break the corpus into several chunks, one training feed in one chunk, and next chunk later.
outer epoch loop and inner batch loop are no different. between batches(and between epoch loops), only weights(trainable variables) are updated, rnn state(not variables, not updated) is reset(default: zero_state) when a new batch is fed.The state reflects information of a sentence.so different sequence’s state are independent.so in order to make chunks the same as batches, just outside the batch loop, inside the epoch loop,feed the chunks,that is, **each epoch loop, feed one chunk.**when all epochs finished, the large corpus is finished.the epoch here doesn’t mean to loop over the whole dataset, but one chunk, cause the corpus is large enough, no need to loop over it many times.

model ckpt only needed to restore model later, for tuning or other purpose.during epoch or batch loop, no need to use ckpt to restore model.the weights will be updated automatically based on previous loop.

get_batch

yield & shuffle and random.choice & append to make batch are similar considering memory used.

code logic

using seq2seq library in tensorflow/python/ops/seq2seq.py

model.py ---- build seq2seq modeldata_utils.py ---- prepare datatrain.py ---- start traininginfer.py ---- start inferring(decode)

some notes

data pre-processing

  • _EOS symbols is added to both encoder inputs and decoder inputs to denote the end of a sentence, treat _EOS as original part of a sentence
  • _PAD and _GO symbol are added in get_batch for decoder_inputs, _PAD symbol is added for encoder inputs, after _EOS anotation
  • data tokenizer need to be modified
  • data_utils.py use b”str”, faster than ordinary str?
  • vocab data is too large, write to disk file
  • file too large read it using readline() instead of read() or readlines()
  • gfile.GFile() has similar function as python open(), maybe useful in google cloud
  • dev data is small, can fit into memory, using read_dev_data(), open source and target corpus simultaneously, read line by line, create tuple, judge which bucket list to fit into, and return 2-d buckets list
  • train data is too large, cannot fit into memory.

    1. divide one large corpus file into several pieces of files, using itertools
    2. read_dev_data to create pieces of 2-d bucket list using pieces of source and target files,if merging those list to obtain the full 2-d bucket list, too much memory consumed
    3. wirte(append mode) the result of 2 to len(buckets) num of files, instead of using list to store the dataset, resulting in bucket0, bucket1,… a list of file

    这里写图片描述

this process is like map and reduce operation.

training

  • before training ,we need to confirm whether the model should be loaded from previous ckpt file or created with fresh parameters
  • if not training one corpus just one time, that is, training the first part and checkpoint, and then start the second part, we’ll lost the best_loss information(for saving the best model), must pass by the best_loss between to training process, this is not recommended, better training in one time, if corpus too large, just break it into small corpus, loop over small ones.
  • during one epoch, the bucket_seq2seq model only takes data of one bucket(the data items are all of the same bucket, may be one bucket chunks), training this, and validation on all buckets of dev data, choose the loss as the metric of model performance, among epochs, save the one with minimum loss
  • validation metrics must be changed to belu score in the future
  • learning rate decrease during epochs
  • need add infer(beam_search)