ML 中的流式输入

来源：互联网发布：js设置disabled属性编辑：程序博客网时间：2024/05/22 06:51

简介

数据集大的时候, 一次性加载到内存里很困难, 所以keras, tf等就提供一些接口进行流式处理.

1.Keras

keras.engine.training.Model#fit_generator(self, generator,…)
与下面的 Model#fit()作比较, x与y 两个参数合为了generator一个参数. 这个函数由用户自定义, 通过yield 返回 (X.y) pair.
keras.engine.training.Model#fit(self, x=None, y=None, …)

例子:

# Fits the model on data generated batch-by-batch by a Python generator.def generate_arrays_from_file(path):    while 1:        f = open(path)        for line in f:            # create Numpy arrays of input data            # and labels, from each line in the file            x, y = process_line(line)            yield (x, y)        f.close()model.fit_generator(generate_arrays_from_file('/my_file.txt'),                    steps_per_epoch=1000, epochs=10)

2. tensorFlow

2.1常用类

tensorflow.python.ops.io_ops.TextLineReader
根据换行符, 逐行输出文件内容.
init(self, skip_header_lines=None, name=None)
构造函数.
tensorflow.python.ops.io_ops.ReaderBase
各种Reader的基类.
- ReaderBase#read(self, queue, name=None)
  Returns the next record (key, value pair) produced by a reader.
tf.decode_csv(records, record_defaults, field_delim=None,
use_quote_delim=None, name=None)
Convert CSV records to tensors.
record_defaults: A list of Tensor objects with types from: float32, int32, int64, string.
tensorflow.python.training.input.batch(tensors, batch_size, num_threads=1, capacity=32,enqueue_many=False,shapes=None,dynamic_pad=False,allow_smaller_final_batch=False,shared_name=None, name=None)
多线程batch读取, 提升效率. 既可以接在decode_csv操作之后, 也可以直接用.

2.2 例子

我的github-py代码, tensorflow_practice/IO/read_demo_decode_csv.py

参考

极客学院,数据读取
keras 官方文档

阅读全文

0 0