seq2seq 训练时 feed 自己的数据

来源：互联网发布：单片机和微处理器功能编辑：程序博客网时间：2024/06/05 18:37

在这个文件加入以下代码https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/translate.py

def vectorize_data(data, word_idx): #word_idx >=1 ,frist is unknow-token    Q = []    for line in data:        ss = []        for word in line:            if word not in word_idx:                ss.append(0)            else:                ss.append(word_idx[word])        Q.append(ss)    return Qdef load_data(file):    with open(file) as f:        lines = f.readlines()        chinese_data = []        english_data = []        index = 0        for line in lines:            if line == "\n":                continue            words_list=[]            words = line.split(' ')            [words_list.append(word.strip("\n")) for word in words]            if index % 2 == 0:                chinese_data.append(words_list)            elif index%2 == 1:                english_data.append(words_list)            index+=1        return chinese_data,english_data  chinese_data, english_data = load_data('./data.txt')  _PAD = b"_PAD"  _GO = b"_GO"  _EOS = b"_EOS"  _UNK = b"_UNK"  PAD_ID = 0  GO_ID = 1  EOS_ID = 2  UNK_ID = 3  temp = reduce(lambda x, y: x + y, [story for story in chinese_data])  chinese_vocab = set(temp)  chinese_word_idx = dict((c, i + 4) for i, c in enumerate(chinese_vocab))  chinese_word_idx[_PAD]= PAD_ID  chinese_word_idx[_GO] = GO_ID  chinese_word_idx[_EOS] = EOS_ID  chinese_word_idx[_UNK] = UNK_ID  sentence_max_word_number_chinese = max(map(len, chinese_data))  temp = reduce(lambda x, y: x + y, [story for story in english_data])  english_vocab = set(temp)  english_word_idx = dict((c, i + 4) for i, c in enumerate(english_vocab))  english_word_idx[_PAD] = PAD_ID  english_word_idx[_GO] = GO_ID  english_word_idx[_EOS] = EOS_ID  english_word_idx[_UNK] = UNK_ID  sentence_max_word_number_english = max(map(len, english_data))  chinese_ids = vectorize_data(chinese_data,chinese_word_idx)  english_ids = vectorize_data(english_data, english_word_idx)  for line in english_ids:    line.append(EOS_ID)  data_set = [[] for _ in _buckets]  for chinese_line,english_line in zip(chinese_ids,english_ids):    for bucket_id, (source_size, target_size) in enumerate(_buckets):      if len(chinese_line) < source_size and len(english_line) < target_size:          data_set[bucket_id].append([chinese_line, english_line])          break  train_set = data_set # 替换原来的train_set

数据文件的样子

纽约比加州早三个小时
New York is 3 hours ahead of California
但这没有让加州变慢
but it does not make California slow
有人 22岁毕业了
Someone graduated at the age of 22
但等了五年才找到好的工作
but waited 5 years before securing a good job
有人 25岁当上 CEO
Someone became a CEO at 25
却在 50岁去世
and died at 50
然而另一个人 50岁当上 CEO
While another became a CEO at 50
然后活到 90岁
and lived to 90 years
有人依然单身
Someone is still single
然而也有人已经结婚
while someone else got married
奥巴马 55岁退休
Obama retires at 55
但川普 70岁开始
but Trump starts at 70
本来世界上每个人在自己的时区工作
Absolutely everyone in this world works based on their Time Zone
身边有人可能看似走在你前面
People around you might seem to go ahead of you
有人可能看似在你后面
some might seem to be behind you.
但每个人正在以他们的速度奔跑在他们自己的时区
But everyone is running their own RACE in their own TIME.
不要嫉妒或嘲笑他们
Don’t envy them or mock them
他们在他们的时区你在你的
They are in their TIME ZONE and you are in yours
生命是关于等待正确的时机行动
Life is about waiting for the right moment to act
所以放轻松
So RELAX
你没有落后
You’re not LATE
你没有领先
You’re not EARLY
你非常准时在命运为你安排的时区
You are very much ON TIME, and in your TIME ZONE Destiny set up for you.

0 0