keras 大数据的训练,迭代载入内存
来源:互联网 发布:pid控制算法 编辑:程序博客网 时间:2024/06/04 18:12
keras 对于大数据的训练,无法一次性载入内存,使用迭代器
说明:我是在keras的官方demo上进行修改https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py
1、几点说明,从文件中读入数据,会降低GPU的使用率,如果能够直接将数据载入内存,GPU的使用率会比较高。
结论:全部载入内存,GPU的使用率可以达到82%,如果边载入数据边训练,只能达到48%
2、keras 使用迭代器来实现大数据的训练,其简单的思想就是,使用迭代器从文件中去顺序读取数据。所以,自己的训练数据一定要先随机打散。因为,我们的迭代器也是每次顺序读取一个batch_size的数据进行训练。
举例如下:数据如下,前400维是特征,后一维是label
keras 官方的demo 如下:
说明:官方的demo还是有瑕疵的,没有实现batch_size,该demo每次只能提取一个样本。我针对上述的数据集,实现的batch_size数据提取的迭代器,代码如下:训练时候的代码如下:
3、关于samples_per_epoch的说明:
我的训练数据,train只有25000行,batch_size=32。照理说samples_per_epoch=32,但是会有警告.UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results
说明:这个出错的原因是train的数目/batch_size不是整数。可以将samples_per_epoch = ceil(train_num/batch_size) *batch_size.设置完的结果为88.72%:
keras的demo使用的方法是将全部数据载入进来训练:
demo的结果为88.86%,所以,该数据读取的方式基本没问题。但是,一定要将数据先进行打乱。如果能全部载入内存,就全部载入内存,速度会快不少
How can I use Keras with datasets that don't fit in memory?
You can do batch training using model.train_on_batch(X, y)
and model.test_on_batch(X, y)
. See the models documentation.
Alternatively, you can write a generator that yields batches of training data and use the methodmodel.fit_generator(data_generator, steps_per_epoch, epochs)
.
You can see batch training in action in our CIFAR10 example.
参考:http://blog.csdn.net/xinfeng2005/article/details/71600652
代码1:图像分类
import codecs
import cv2
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.layers import *
from keras.models import *
from keras.callbacks import *
from visual_callbacks import AccLossPlotter
plotter = AccLossPlotter(graphs=['acc', 'loss'], save_graph=True)
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
datagen = ImageDataGenerator(
rotation_range=0,
width_shift_range=0.1,
height_shift_range=0.1,
rescale=1./255,
shear_range=0.1,
zoom_range=0.1,
horizontal_flip=False,
fill_mode='nearest')
train_generator = datagen.flow_from_directory(
r'chars_rec\train', # this is the target directory
target_size=(32, 32), # all images will be resized to 150x150
batch_size=32,
shuffle=True,
class_mode='categorical', color_mode='grayscale') # since we use binary_crossentropy loss, we need binary labels
print(train_generator.nb_class)
class_count=train_generator.nb_class
# print(train_generator.class_indices)
# print(type(train_generator.class_indices))
np.save('class_indices.txt', train_generator.class_indices)
import cv2from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_imgfrom keras.layers import *from keras.models import *from keras.callbacks import *# from visual_callbacks import AccLossPlotter# plotter = AccLossPlotter(graphs=['acc', 'loss'], save_graph=True)class LossHistory(Callback): def on_train_begin(self, logs={}): self.losses = [] def on_batch_end(self, batch, logs={}): self.losses.append(logs.get('loss'))datagen = ImageDataGenerator( rotation_range=0, width_shift_range=0.1, height_shift_range=0.1, rescale=1./255, shear_range=0.1, zoom_range=0.1, horizontal_flip=False, fill_mode='nearest')train_generator = datagen.flow_from_directory(r'./../mouse_data\data', # this is the target directory target_size=(32, 32), # all images will be resized to 150x150 batch_size=32, shuffle=True, class_mode='categorical', color_mode='grayscale') # since we use binary_crossentropy loss, we need binary labelsprint(train_generator.samples)class_count=train_generator.num_classprint(class_count)
class_indices=np.load('class_indices.txt.npy')print(class_indices)# print(type(class_indices))class_indices=class_indices.tolist()# print(type(class_indices))value_indices={v:k for k,v in class_indices.items()}'''# exit()validation_generator = datagen.flow_from_directory( r'chars_rec\valication', # this is the target directory target_size=(32, 32), # all images will be resized to 150x150 batch_size=32, class_mode='categorical', color_mode='grayscale') # since we use binary_crossentropy loss, we need binary labels######################################################model = Sequential()model.add(Conv2D(32, 3, 3, input_shape=( 32, 32, 1), border_mode='same', activation='relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Conv2D(32, 3, 3, border_mode='same', activation='relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Conv2D(64, 3, 3, border_mode='same', activation='relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Flatten()) # this converts our 3D feature maps to 1D feature vectorsmodel.add(Dense(128))model.add(Activation('relu'))model.add(Dropout(0.5))model.add(Dense(class_count))model.add(Activation('softmax'))###################################################
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])# 用于保存验证集误差最小的参数,当验证集误差减少时,立马保存下来checkpointer = ModelCheckpoint(filepath="chars_rec.hdf5", verbose=1, save_best_only=True, )history = LossHistory()if os.path.exists('chars_rec.hdf5'): model = load_model('chars_rec.hdf5')model.fit_generator( train_generator, # steps_per_epoch=2000 ,#// batch_size, # epochs=50, samples_per_epoch=9150 ,#// batch_size, nb_epoch=500, validation_data=validation_generator, nb_val_samples=1062, callbacks=[checkpointer, history, plotter] )#// batch_size)validation_steps=800model.save('chars_rec_end.hdf5')
代码2:文本标引
def getXY_gen(batch_size=32): # f = file(".\\SheKeYuan_YinWen0303_train.utf8") # data = f.read()[0:].decode('utf-8') # f.close() f = open(".\\train_470000_0427.utf8",'r',encoding='utf-8') lines = f.readlines() f.close() # print(lines[0:10]) # exit() X=[] Y=[] for i in range(len(lines)):#tqdm(range(len(lines))): line=lines[i].strip() # print(i+1)#,line) # if i>=100000: # break x = [] y = [] y_temp = [] for j,string in enumerate(line.split(' ')): if string.find('=')<0: continue str,label=string.rsplit('=',1) # print(str,label) label_num=int(label)+1 if len(str)>1 and label_num<100: for k,s in enumerate(str): x.append(ord(s)) lab=[0]*class_label_count if k==0: lab[label_num-1]=1 y_temp.append(label_num-1) elif k==len(str)-1: lab[label_num-1+2]=1 y_temp.append(label_num-1+2) else: lab[label_num-1+1]=1 y_temp.append(label_num-1+1) y.append(lab) elif len(str)==1: x.append(ord(str[0])) lab = [0] * class_label_count lab[label_num - 1+3] = 1 # if label_num - 1+3==99: # print('xxx') y.append(lab) y_temp.append(label_num-1+3) # print(x) # print(y_temp) if len(x)<max_len: for a in range(0,max_len-len(x)): x.append(0) lab = [0] * class_label_count lab[87] = 1 y.append(lab) X.append(x[0:max_len]) Y.append(y[0:max_len]) if len(X)==batch_size: # print(np.shape(X)) # print(X) x1=X[0:batch_size] y1=Y[0:batch_size] X=[] Y=[] yield np.array(x1),np.array(y1)
sequence = Input(shape=(max_len,), dtype='int32')embedded = Embedding(65536, 128, input_length=max_len, mask_zero=True, trainable=False)(sequence)blstm = Bidirectional(LSTM(64, return_sequences=True, dropout_U=0.5, dropout_W=0.5), merge_mode='sum')(embedded)# blstm = (LSTM(128, return_sequences=True, dropout_U=0.5, dropout_W=0.5), merge_mode='sum')(embedded)output = TimeDistributed(Dense(class_label_count, activation='softmax'))(blstm)model = Model(input=sequence, output=output)if os.path.exists('bilstm_0510.hdf5'): model=load_model('bilstm_0510.hdf5')model.layers[1].trainable = False#Embedding层不再训练adam = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)# sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)#mse mae mape msle binary_crossentropy categorical_crossentropy sparse_categorical_crossentropy# kullback_leibler_divergence#cosine_proximitymodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])# 用于保存验证集误差最小的参数,当验证集误差减少时,立马保存下来checkpointer = ModelCheckpoint(filepath="bilstm_0510.hdf5", verbose=0, save_best_only=True, )history = LossHistory()# history = model.fit(np.array(x_train), np.array(y_train).reshape((-1,max_len,class_label_count)),# batch_size=32, nb_epoch=500,validation_data = (x_test,y_test),# callbacks=[checkpointer, history, plotter],# verbose=1# )model.fit_generator(getXY_gen(batch_size=32), samples_per_epoch=32*100 , nb_epoch=10, verbose=1, callbacks=[checkpointer, history] )
- keras 大数据的训练,迭代载入内存
- keras 对于大数据的训练,无法一次性载入内存,使用迭代器
- Keras 训练时不用将数据全部加入内存
- 将大数据载入内存中检索字符串
- Keras框架训练模型保存及再载入
- 保存Keras训练的模型
- keras用vgg16预训练的参数训练自己数据集
- 大数据食代
- 大数据集的SVM训练方法
- 详细说明用keras建立训练自己数据的LSTM----语音方向
- keras入门 ---在小数据集上训练神经网络
- 大数据告诉我们:《小时代》是属于谁的小时代?
- json数据的直接载入
- 大数据下的逻辑回归训练模型方法论
- c++训练题(牵扯到大数据的保存)
- 大数据下的逻辑回归训练模型方法论
- js数据的深度迭代
- 基于keras的二分类的网络训练代码
- 2_发放奖金总数
- 3_该数是多少
- 请大神指教,自己设计的给任意整数排序的程序
- 670. Maximum Swap
- iOS进阶--App功耗优化看这篇就够了
- keras 大数据的训练,迭代载入内存
- 12.10作业
- 月薪30K+:程序员必备成长宝典
- 指针与函数及指针与数组之间的关系
- 缅怀我的以前的初中同学
- 169. Majority Element
- ubuntu 解决“无法获得锁 /var/lib/dpkg/lock -open (11:资源暂时不可用)”的方法
- Linux第六天的学习记录
- Linux:用Screen管理你的远程会话