keras 大数据的训练,迭代载入内存

来源:互联网 发布:pid控制算法 编辑:程序博客网 时间:2024/06/04 18:12

keras 对于大数据的训练,无法一次性载入内存,使用迭代器

说明:我是在keras的官方demo上进行修改https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py

1、几点说明,从文件中读入数据,会降低GPU的使用率,如果能够直接将数据载入内存,GPU的使用率会比较高。

结论:全部载入内存,GPU的使用率可以达到82%,如果边载入数据边训练,只能达到48%


2、keras 使用迭代器来实现大数据的训练,其简单的思想就是,使用迭代器从文件中去顺序读取数据。所以,自己的训练数据一定要先随机打散。因为,我们的迭代器也是每次顺序读取一个batch_size的数据进行训练。

举例如下:数据如下,前400维是特征,后一维是label


keras 官方的demo 如下:

[python] view plain copy
  1. def generate_arrays_from_file(path):  
  2.     while 1:  
  3.     f = open(path)  
  4.     for line in f:  
  5.         # create Numpy arrays of input data  
  6.         # and labels, from each line in the file  
  7.         x, y = process_line(line)  
  8.         yield (x, y)  
  9.     f.close()  
  10.   
  11. model.fit_generator(generate_arrays_from_file('/my_file.txt'),  
  12.         samples_per_epoch=10000, nb_epoch=10)  
说明:官方的demo还是有瑕疵的,没有实现batch_size,该demo每次只能提取一个样本。我针对上述的数据集,实现的batch_size数据提取的迭代器,代码如下:

[python] view plain copy
  1. def process_line(line):  
  2.     tmp = [int(val) for val in line.strip().split(',')]  
  3.     x = np.array(tmp[:-1])  
  4.     y = np.array(tmp[-1:])  
  5.     return x,y  
  6.   
  7. def generate_arrays_from_file(path,batch_size):  
  8.     while 1:  
  9.         f = open(path)  
  10.         cnt = 0  
  11.         X =[]  
  12.         Y =[]  
  13.         for line in f:  
  14.             # create Numpy arrays of input data  
  15.             # and labels, from each line in the file  
  16.             x, y = process_line(line)  
  17.             X.append(x)  
  18.             Y.append(y)  
  19.             cnt += 1  
  20.             if cnt==batch_size:  
  21.                 cnt = 0  
  22.                 yield (np.array(X), np.array(Y))  
  23.                 X = []  
  24.                 Y = []  
  25.     f.close()  

训练时候的代码如下:

[python] view plain copy
  1. model.fit_generator(generate_arrays_from_file('./train',batch_size=batch_size),  
  2.         samples_per_epoch=25024,nb_epoch=nb_epoch,validation_data=(X_test, y_test),max_q_size=1000,verbose=1,nb_worker=1)  

3、关于samples_per_epoch的说明:

我的训练数据,train只有25000行,batch_size=32。照理说samples_per_epoch=32,但是会有警告.UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results


说明:这个出错的原因是train的数目/batch_size不是整数。可以将samples_per_epoch = ceil(train_num/batch_size) *batch_size.设置完的结果为88.72%:


keras的demo使用的方法是将全部数据载入进来训练:


demo的结果为88.86%,所以,该数据读取的方式基本没问题。但是,一定要将数据先进行打乱。如果能全部载入内存,就全部载入内存,速度会快不少


How can I use Keras with datasets that don't fit in memory?

You can do batch training using model.train_on_batch(X, y) and model.test_on_batch(X, y). See the models documentation.

Alternatively, you can write a generator that yields batches of training data and use the methodmodel.fit_generator(data_generator, steps_per_epoch, epochs).

You can see batch training in action in our CIFAR10 example.

参考:http://blog.csdn.net/xinfeng2005/article/details/71600652

代码1:图像分类

import codecs
import cv2
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.layers import *
from keras.models import *
from keras.callbacks import *
from visual_callbacks import AccLossPlotter
plotter = AccLossPlotter(graphs=['acc', 'loss'], save_graph=True)
class LossHistory(Callback):
    def on_train_begin(self, logs={}):
        self.losses = []
 
    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))
 
datagen = ImageDataGenerator(
        rotation_range=0,
        width_shift_range=0.1,
        height_shift_range=0.1,
        rescale=1./255,
        shear_range=0.1,
        zoom_range=0.1,
        horizontal_flip=False,
        fill_mode='nearest')
train_generator = datagen.flow_from_directory(
        r'chars_rec\train',  # this is the target directory
        target_size=(32, 32),  # all images will be resized to 150x150
        batch_size=32,
        shuffle=True,
        class_mode='categorical', color_mode='grayscale')  # since we use binary_crossentropy loss, we need binary labels
print(train_generator.nb_class)
class_count=train_generator.nb_class
# print(train_generator.class_indices)
# print(type(train_generator.class_indices))
np.save('class_indices.txt', train_generator.class_indices)

测试OK的:
import cv2from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_imgfrom keras.layers import *from keras.models import *from keras.callbacks import *# from visual_callbacks import AccLossPlotter# plotter = AccLossPlotter(graphs=['acc', 'loss'], save_graph=True)class LossHistory(Callback):    def on_train_begin(self, logs={}):        self.losses = []    def on_batch_end(self, batch, logs={}):        self.losses.append(logs.get('loss'))datagen = ImageDataGenerator(        rotation_range=0,        width_shift_range=0.1,        height_shift_range=0.1,        rescale=1./255,        shear_range=0.1,        zoom_range=0.1,        horizontal_flip=False,        fill_mode='nearest')train_generator = datagen.flow_from_directory(r'./../mouse_data\data',  # this is the target directory        target_size=(32, 32),  # all images will be resized to 150x150        batch_size=32,        shuffle=True,        class_mode='categorical', color_mode='grayscale')  # since we use binary_crossentropy loss, we need binary labelsprint(train_generator.samples)class_count=train_generator.num_classprint(class_count)

class_indices=np.load('class_indices.txt.npy')print(class_indices)# print(type(class_indices))class_indices=class_indices.tolist()# print(type(class_indices))value_indices={v:k for k,v in class_indices.items()}
'''# exit()validation_generator = datagen.flow_from_directory( r'chars_rec\valication'# this is the target directory target_size=(3232), # all images will be resized to 150x150 batch_size=32class_mode='categorical'color_mode='grayscale'# since we use binary_crossentropy loss, we need binary labels######################################################model = Sequential()model.add(Conv2D(3233input_shape=( 32321), border_mode='same'activation='relu'))model.add(MaxPooling2D(pool_size=(22)))model.add(Conv2D(3233border_mode='same'activation='relu'))model.add(MaxPooling2D(pool_size=(22)))model.add(Conv2D(6433border_mode='same'activation='relu'))model.add(MaxPooling2D(pool_size=(22)))model.add(Flatten()) # this converts our 3D feature maps to 1D feature vectorsmodel.add(Dense(128))model.add(Activation('relu'))model.add(Dropout(0.5))model.add(Dense(class_count))model.add(Activation('softmax'))###################################################
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])# 用于保存验证集误差最小的参数,当验证集误差减少时,立马保存下来checkpointer = ModelCheckpoint(filepath="chars_rec.hdf5", verbose=1, save_best_only=True, )history = LossHistory()if os.path.exists('chars_rec.hdf5'):    model = load_model('chars_rec.hdf5')model.fit_generator(        train_generator,        # steps_per_epoch=2000 ,#// batch_size,        # epochs=50,        samples_per_epoch=9150 ,#// batch_size,        nb_epoch=500,        validation_data=validation_generator,        nb_val_samples=1062,        callbacks=[checkpointer, history, plotter]         )#// batch_size)validation_steps=800model.save('chars_rec_end.hdf5')

代码2:文本标引

def getXY_gen(batch_size=32):    # f = file(".\\SheKeYuan_YinWen0303_train.utf8")    # data = f.read()[0:].decode('utf-8')    # f.close()    f = open(".\\train_470000_0427.utf8",'r',encoding='utf-8')    lines = f.readlines()    f.close()    # print(lines[0:10])    # exit()    X=[]    Y=[]    for i in range(len(lines)):#tqdm(range(len(lines))):        line=lines[i].strip()        # print(i+1)#,line)        # if i>=100000:        #     break        x = []        y = []        y_temp = []        for j,string in enumerate(line.split('  ')):            if string.find('=')<0:                continue            str,label=string.rsplit('=',1)            # print(str,label)            label_num=int(label)+1            if len(str)>1 and label_num<100:                for k,s in enumerate(str):                    x.append(ord(s))                    lab=[0]*class_label_count                    if k==0:                        lab[label_num-1]=1                        y_temp.append(label_num-1)                    elif k==len(str)-1:                        lab[label_num-1+2]=1                        y_temp.append(label_num-1+2)                    else:                        lab[label_num-1+1]=1                        y_temp.append(label_num-1+1)                    y.append(lab)            elif len(str)==1:                x.append(ord(str[0]))                lab = [0] * class_label_count                lab[label_num - 1+3] = 1                # if label_num - 1+3==99:                #     print('xxx')                y.append(lab)                y_temp.append(label_num-1+3)        # print(x)        # print(y_temp)        if len(x)<max_len:            for a in range(0,max_len-len(x)):                x.append(0)                lab = [0] * class_label_count                lab[87] = 1                y.append(lab)        X.append(x[0:max_len])        Y.append(y[0:max_len])        if len(X)==batch_size:            # print(np.shape(X))            # print(X)            x1=X[0:batch_size]            y1=Y[0:batch_size]            X=[]            Y=[]            yield np.array(x1),np.array(y1)
sequence = Input(shape=(max_len,), dtype='int32')embedded = Embedding(65536, 128, input_length=max_len, mask_zero=True, trainable=False)(sequence)blstm = Bidirectional(LSTM(64, return_sequences=True, dropout_U=0.5, dropout_W=0.5), merge_mode='sum')(embedded)# blstm = (LSTM(128, return_sequences=True, dropout_U=0.5, dropout_W=0.5), merge_mode='sum')(embedded)output = TimeDistributed(Dense(class_label_count, activation='softmax'))(blstm)model = Model(input=sequence, output=output)if os.path.exists('bilstm_0510.hdf5'):    model=load_model('bilstm_0510.hdf5')model.layers[1].trainable = False#Embedding层不再训练adam = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)# sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)#mse mae mape msle binary_crossentropy categorical_crossentropy sparse_categorical_crossentropy# kullback_leibler_divergence#cosine_proximitymodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])# 用于保存验证集误差最小的参数,当验证集误差减少时,立马保存下来checkpointer = ModelCheckpoint(filepath="bilstm_0510.hdf5", verbose=0, save_best_only=True, )history = LossHistory()# history = model.fit(np.array(x_train), np.array(y_train).reshape((-1,max_len,class_label_count)),#                     batch_size=32, nb_epoch=500,validation_data = (x_test,y_test),#                     callbacks=[checkpointer, history, plotter],#                     verbose=1#                     )model.fit_generator(getXY_gen(batch_size=32), samples_per_epoch=32*100 , nb_epoch=10,                    verbose=1,                    callbacks=[checkpointer, history]                    )

原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 孩子考差了父母怎么办 保险公司不给业务员办退司怎么办 我不习惯没有你我怎么办 锁坏了打不开了怎么办 要上班老人生病无人照顾怎么办 苹果手机一直说英文怎么办 公司很抠门怎么办英文怎么说 过了截港时间怎么办 截关日期是假日怎么办 恒温阀冷水进水堵塞怎么办 缺氧液泵管道堵塞怎么办 货物包装大集装箱装不下怎么办 微信收藏的视频格式错误怎么办 乙方被刑拘房租未付清怎么办 房贷银行卡号弄错怎么办 社保卡号弄错了怎么办 社保名字写错了怎么办 档案和身份证年龄姓名不一样怎么办 档案年龄与身份证年龄不一样怎么办 户口本身份证和档案不一样怎么办 如果档案姓名与身份证不符怎么办 感冒吃了白参怎么办 吃辣的嗓子疼怎么办 美团客户更改地址怎么办 忘记steam的账户名称怎么办 重置手机忘了密码怎么办 sp下行短信费扣怎么办 hr公司业务员招不到人怎么办 卖房中介被房倒压房子怎么办 电脑放不了dvd光盘怎么办 股东迟迟不交齐股本金怎么办 wps转pdf就乱了怎么办 被有用分期骗了怎么办 找不到以前有用分期的账号怎么办 打工去韩国不懂韩语怎么办? 想去韩国整容没钱怎么办 专接本没接上怎么办 抄写经文写错了怎么办 在外地修车被宰怎么办 国外汇款公司名称写错了怎么办 增值税专票没有机器编码怎么办