TensorFlow数据读取

来源：互联网发布：国外的知乎编辑：程序博客网时间：2024/06/04 22:32

TensorFlow数据读取背后的通讯机制

tensorflow背后的通讯机制以及存储压缩都是基于Protobuf，包括某些定义，比如graph。Protobuf是开源的。下载地址：Protocol Buffers - Google’s data interchange format

TensorFlow数据IO三种方式

Freload data constant 数据直接嵌入graph，由graph传入session中运行，在这种情况下数据必须是小数据，一个session只能执行一个graph，TensorFlow中的Graph是一个有向无环图（Directed acyclic graph）也就是在开始定义graph的时候就传入。一个session只能执行一个graph，如果把数据直接嵌入在graph里面，graph本身传给不同的devise也是通过压缩成Protobuf传出去的，用到devise的地方都是采取这样的方式传递。这样的话数据需要copy很多次，效率非常低，尽可能的不使用数据嵌入的方式，除非在constant相当少的情况下。
```
import tensorflow as tfx = tf.constant([1,2,3], name = 'x')y = tf.constant([4,5,6], name = 'y')z = tf.add(x,y,name= 'z')with tf.Session() as sess:    print(sess.run(z))
```
运行结果：[5 7 9]
Feeding placeholder fee_dict。由占位符代替数据，运行时填入数据。先定义好placeholder然后再传入数据，placeholder的优点是在graph里面并没有把数据给到，节点的唯一的意图就是为了提供数据供给的方法。placeholder节点被声明的时候是未初始化的，也不包含数据，如果没有为它供给数据，则TensorFlow运算的时候会产生错误，所以一定要给placeholder提供数据。
```
import tensorflow as tfx = tf.placeholder(tf.int16)y = tf.placeholder(tf.int16)z = tf.add(x,y,name= 'z')with tf.Session() as sess:    xs = [1,2,3]    ys = [5,6,7]    print(sess.run(z,feed_dict={x: xs,y: ys}))
```
运行结果：[ 6 8 10]
三种读取数据的方式：Pipeline：Reader，Queue机制1.producer-consumer pattern，2.独立于主线程执行。3.异步IO：reader.read(queue) tf.train.batch()
- tf.TextLineReader()
- tf.WholeFileReader()
- tf.TFRecordReader()

TensorFlow使用Queue异步读取数据：

使用Queue异步实现读取数据到tensorflow模型，首先使用tf.train.match_filenames_once在当地文件data下所有的.csv文件，操作会解析这一行内容并将其转为张量列表。match转成list文件的格式传给filenames。2。定义一个filename_queue，指定filenames，是否打乱文件顺序shuffle=False，假如有A.B文件num_epochs=3就相当于，A.B ,A.B ,A.B,就会有6个文件在里面，都是指向A.B A.B A.B。3读取使用tf.TextLineReader()每次read的执行都会从文件中读取一行内容，把filename_queue喂给reader.read会返回一个key一个value，key的作用不大，value就是读取到行的内容。4取到的.csv可以用decode_csv读value。record_defaultsmei没有定义所以就传’null’。然后在tf.Session去用。
```
import tensorflow as tffilenames = tf.train.match_filenames_once('.\data\*.csv')filename_queue = tf.train.string_input_producer(filenames, shuffle=False, num_epochs=3)reader = tf.TextLineReader()_, value = reader.read(filename_queue)example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']])
```
tf.local_variables_initializer()相当于初始化一个Queue，先启动tf.train.Coordinator一个线程管理器，运行训练步骤之前，需要调用tf.train.start_queue_runners函数，否则数据流图将一直挂起。tf.train.start_queue_runners 这个函数将会启动输入管道的线程，填充样本到队列中，以便出队操作可以从队列中拿到样本。这种情况下最好配合使用一个tf.train.Coordinator，这样可以在发生错误的情况下正确地关闭这些线程。如果你对训练迭代数做了限制，那么需要使用一个训练迭代数计数器，并且需要被初始化。tf.train.start_queue_runners相当于一个进程。背后的机制是run一个进程。所以需要用Coordinator来管理。Queue启动之后sess.run([example, label])不断地在Queue里面取[example, label]，取5次。epochs=3的情况下会有6个文件，内存Queue一个文件里面有两个record就会有sess.run到12个才会stop。所有的培训线程完成使用coord.request_stop()，coord.join(threads)退出。
```
init_op = tf.local_variables_initializer()with tf.Session() as sess:    sess.run(init_op)    coord = tf.train.Coordinator()     threads = tf.train.start_queue_runners(coord=coord)     """    try:        while not coord.should_stop():            print(sess.run([example, label]))    except tf.errors.OutOfRangeError:        print('Epochs complete!')    finally:        coord.request_stop()    """    for _ in range(5):        print(sess.run([example, label]))    coord.request_stop()    coord.join(threads)
```
TensorFlow数据流图

1，创建数据流图A,B,C三个文件，这个数据流图由一些流水线的阶段组成，阶段间用队列连接在一起。第一阶段将生成文件名，我们读取这些文件名并且把他们排到文件名队列中。第二阶段从文件队列中读取数据（使用Reader），产生样本，而且把样本放在一个样本队列中。Reader读取数据后会给到Decoder.csv。Decoder.csv会把ExampleQueue里面，这个ExampleQueue不需要自己去实现，不需要在tensorflow去做处理，sess.run([example, label])有2个Queue。mode在sess.run的时候不断在内存Queue里面取数据，如果内存Queue没有数据了就会报错’Epochs complete!’文件的话

TensorFlow标准数据格式TFRecord

TFRecord优点主要有两个方面，TFRecords可以统一不同输入文件的框架。TFRecords节约了空间TFRecords会压缩二进制文件，Protocol Buffer 序列化。TFRecords本身类似于Protobuf的机制，TFRecords可以存储任何形式的数据，相当于封装了数据解析、多线程等操作。
```
message Example {      Features features = 1;  };  message Features {      map<string, Feature> feature = 1;  };  message Feature {      oneof kind {      BytesList bytes_list = 1;      FloatList float_list = 2;      Int64List int64_list = 3;  }  };  
```
Features features = 1类似于变量的类型以及变量名，1是变量名相当于tag。Features在message Features定义了一套map，第一个是string，第二个是feature。feature的定义在message Feature是oneof kind 其中的一种也就是value，有三种类型String，float，int。可以通过tag找到map存储的地方。在反序列化的时候会用到。

原始数据.csv数据转TFRecord存储，TFRecord流程是得到data后不管是什么数据类型，会把data转成TFRecord的形式存储到的地方，然后在启动Queue。再read到程序的mode里面。

import tensorflow as tfimport numpy as npimport pandas as pdtrain_frame = pd.read_csv("train.csv")print (train_frame.head())train_labels_frame = train_frame.pop(item="label")train_values = train_frame.valuestrain_labels = train_labels_frame.valuesprint("values shape1:",train_values.shape)print("labels shape2:",train_labels.shape)

运行结果：
这里写图片描述

阅读全文

0 0