第一篇博文--TensorFlow学习1

来源:互联网 发布:宽带有网络wifi连不上 编辑:程序博客网 时间:2024/05/16 03:52

整理一下思路:我的研究课题是词义消歧,读了谷歌大神的论文,用神经网络作词义消歧。然后用Keras还原,后来发现Keras的效果怎么都不好,经过和师兄的交流,师兄严肃地建议我用TensorFlow还原作者的实验。于是开始学习TF,假期在家看了莫凡的教学视频,但是都是一些很基础的东西,读github上的代码还是很吃力,因为不像Keras的汉化做的那么好,TF的各种方法都没有中文的使用说明。

那怎么办呢?跟论文作者联系要一下源码,被告知开源在即,所以转换一下策略,不再急于写出用于实验的代码,而是专心学习一下TF。(老师催就催吧,急于求成等于一事无成),学习这个东西思绪必须要清晰,现准备从头读一篇代码,弄懂其中每一步的意义。直接看英文的说明太懵了,那么多方法也不能一下都记住。所以选择了这个方式。

从这个教程中的代码开始:

http://www.tensorfly.cn/tfdoc/tutorials/recurrent.html

https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb

从if __name__=='__main__'开始:之前一直不知道这句干嘛用的,这句话的意思是:当模块被直接运行时,才运行此段代码下的代码块,如果此模块被导入,就不运行。

下面开始正式的代码了。从程序运行的过程一步步来看。简单的说明在程序代码右侧加注释。如果比较复杂的函数我会在代码下面进行说明。

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# =============================================================================="""Example / benchmark for building a PTB LSTM model.Trains the model described in:(Zaremba, et. al.) Recurrent Neural Network Regularization                        #训练这篇文章中的RNNhttp://arxiv.org/abs/1409.2329There are 3 supported model configurations:===========================================| config | epochs | train | valid  | test===========================================| small  | 13     | 37.99 | 121.39 | 115.91| medium | 39     | 48.45 |  86.16 |  82.07| large  | 55     | 37.87 |  82.62 |  78.29The exact results may vary depending on the random initialization.               #实际结果可能和随机初始化的不同而变化The hyperparameters used in the model:                                           #模型中使用的参数- init_scale - the initial scale of the weights                                  - learning_rate - the initial value of the learning rate- max_grad_norm - the maximum permissible norm of the gradient                   #梯度的最大容许标准(不太懂)- num_layers - the number of LSTM layers- num_steps - the number of unrolled steps of LSTM                               #这个指的就是time_step,也就是输入的词的个数- hidden_size - the number of LSTM units                                         - max_epoch - the number of epochs trained with the initial learning rate        #初始学习效率- max_max_epoch - the total number of epochs for training                         - keep_prob - the probability of keeping weights in the dropout layer            #1-dropout- lr_decay - the decay of the learning rate for each epoch after "max_epoch"     #学习效率衰减- batch_size - the batch size    The data required for this example is in the data/ dir of thePTB dataset from Tomas Mikolov's webpage:$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz$ tar xvf simple-examples.tgzTo run:$ python ptb_word_lm.py --data_path=simple-examples/data/"""from __future__ import absolute_importfrom __future__ import divisionfrom __future__ import print_functionimport timeimport numpy as npimport tensorflow as tfimport reader#在构建模型和训练之前,我们首先需要设置一些参数。tf中可以使用tf.flags来进行全局的参数设置flags = tf.flagslogging = tf.loggingflags.DEFINE_string(    "model", "small",    "A type of model. Possible options are: small, medium, large.")          #定义变量model的值为小,后面的是注释flags.DEFINE_string("data_path", None,                    "Where the training/test data is stored.")               #定义下载好的数据存放位置flags.DEFINE_string("save_path", None,                    "Model output directory.")                               #是否使用float16格式flags.DEFINE_bool("use_fp16", False,                  "Train using 16-bit floats instead of 32bit floats")
FLAGS = flags.FLAGS # 可以使用FLAGS.model来调用变量 model的值

这有一篇比较好的博文,先看一下再接着写http://www.cnblogs.com/wuzhitj/p/6297992.html

init_scale = 0.1        # 相关参数的初始值为随机均匀分布,范围是[-init_scale,+init_scale]learning_rate = 1.0     # 学习速率,在文本循环次数超过max_epoch以后会逐渐降低max_grad_norm = 5       # 用于控制梯度膨胀,如果梯度向量的L2模超过max_grad_norm,则等比例缩小num_layers = 2          # lstm层数num_steps = 20          # 单个数据中,序列的长度。hidden_size = 200       # 隐藏层中单元数目max_epoch = 4           # epoch<max_epoch时,lr_decay值=1,epoch>max_epoch时,lr_decay逐渐减小max_max_epoch = 13      # 指的是整个文本循环次数。keep_prob = 1.0         # 用于dropout.每批数据输入时神经网络中的每个单元会以1-keep_prob的概率不工作,可以防止过拟合lr_decay = 0.5          # 学习速率衰减batch_size = 20         # 每批数据的规模,每批有20个。vocab_size = 10000      # 词典规模,总共10K个词

if __name__ == "__main__":  tf.app.run()
main中只有这一句,参考这篇博文http://blog.csdn.net/helei001/article/details/51859423。其实我也不太懂,先往下看。

main函数:

(看代码说明要去英文官网,需要翻墙,)

def main(_):  if not FLAGS.data_path:                                            #如果data.path==None,报错    raise ValueError("Must set --data_path to PTB data directory")       
  raw_data = reader.ptb_raw_data(FLAGS.data_path)             

源码如下:

def ptb_raw_data(data_path=None):                                      """Load PTB raw data from data directory "data_path".  Reads PTB text files, converts strings to integer ids,  and performs mini-batching of the inputs.  The PTB dataset comes from Tomas Mikolov's webpage:  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz  Args:    data_path: string path to the directory where simple-examples.tgz has      been extracted.  Returns:    tuple (train_data, valid_data, test_data, vocabulary)    where each of the data objects can be passed to PTBIterator.  """#各文件路径和文件名  train_path = os.path.join(data_path, "ptb.train.txt")             valid_path = os.path.join(data_path, "ptb.valid.txt")  test_path = os.path.join(data_path, "ptb.test.txt")  word_to_id = _build_vocab(train_path)                  
def _build_vocab(filename):  data = _read_words(filename)                                      #将所有的句子中的换行替换为<eos>,然后split(),按顺序返回一个所有词的列表
                                                                    #如I have a pen . \n  变成['I','have','a','pen','.','<eos>']
def _read_words(filename):  with tf.gfile.GFile(filename, "r") as f:    return f.read().decode("utf-8").replace("\n", "<eos>").split()

  counter = collections.Counter(data)                               #计数,返回一个Counter({'N':2,'<eos>':2}),且由大到小排序  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))#类型转换,返回一个元组的列表,如[('<eos>',2),('N',2),....],按出现次数由大到小,次数    相同的由小到大排列  words, _ = list(zip(*count_pairs))                                #返回两个元组,一个元组装所有的词,一个装所有的出现次数,并且是一一对应关系。
    #('<eos>','N')(2,2)
                                                                    #http://www.cnblogs.com/frydsh/archive/2012/07/10/2585370.html,zip函数说明  word_to_id = dict(zip(words, range(len(words))))                  #返回一个字典,{'<eos>':0,'N':1},出现次数越多的词对应的整数越小  return word_to_id
  train_data = _file_to_word_ids(train_path, word_to_id)  valid_data = _file_to_word_ids(valid_path, word_to_id)  test_data =_file_to_word_ids(test_path, word_to_id)
def _file_to_word_ids(filename, word_to_id):  data = _read_words(filename)                                      return [word_to_id[word] for word in data if word in word_to_id] #把整个文件的换行替换成<eos>,然后将词的列表替换成对应的id列表。
  vocabulary = len(word_to_id)                                     #有多少个词  return train_data, valid_data, test_data, vocabulary             #返回各个文件的id列表和字典长度






                                             
0 0
原创粉丝点击