keras中文文档笔记10——数据预处理

来源：互联网发布：单片机常用代码编辑：程序博客网时间：2024/06/04 18:02

序列预处理

填充序列pad_sequences

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32',    padding='pre', truncating='pre', value=0.)

将长为nb_samples的序列（标量序列）转化为形如(nb_samples,nb_timesteps)2D numpy array。如果提供了参数maxlen，nb_timesteps=maxlen，否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps的序列将会被截断，以使其匹配目标长度。padding和截断发生的位置分别取决于padding和truncating.

跳字skipgrams

keras.preprocessing.sequence.skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1., shuffle=True, categorical=False, sampling_table=None)

skipgrams将一个词向量下标的序列转化为下面的一对tuple：

对于正样本，转化为（word，word in the same window）
对于负样本，转化为（word，random word from the vocabulary）

【Tips】根据维基百科，n-gram代表在给定序列中产生连续的n项，当序列句子时，每项就是单词，此时n-gram也称为shingles。而skip-gram的推广，skip-gram产生的n项子序列中，各个项在原序列中不连续，而是跳了k个字。例如，对于句子：

“the rain in Spain falls mainly on the plain”

其 2-grams为子序列集合：

the rain，rain in，in Spain，Spain falls，falls mainly，mainly on，on the，the plain

其 1-skip-2-grams为子序列集合：

the in, rain Spain, in falls, Spain mainly, falls on, mainly the, on plain.
获取采样表make_sampling_table

keras.preprocessing.sequence.make_sampling_table(size, sampling_factor=1e-5)

该函数用以产生skipgrams中所需要的参数sampling_table。这是一个长为size的向量，sampling_table[i]代表采样到数据集中第i常见的词的概率（为平衡期起见，对于越经常出现的词，要以越低的概率采到它）

文本预处理

句子分割text_to_word_sequence

keras.preprocessing.text.text_to_word_sequence(text,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")

本函数将一个句子拆分成单词构成的列表

one-hot编码

keras.preprocessing.text.one_hot(text,n,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")

本函数将一段文本编码为one-hot形式的码，即仅记录词在词典中的下标。

【Tips】从定义上，当字典长为n时，每个单词应形成一个长为n的向量，其中仅有单词本身在字典中下标的位置为1，其余均为0，这称为one-hot。

为了方便起见，函数在这里仅把“1”的位置，即字典中词的下标记录下来。

特征哈希hashing_trick

keras.preprocessing.text.hashing_trick(text,n,hash_function=None,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ')

将文本转换为固定大小的哈希空间中的索引序列

分词器Tokenizer

keras.preprocessing.text.Tokenizer(num_words=None,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ",char_level=False)

Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类。

图片预处理

图片生成器ImageDataGenerator

keras.preprocessing.image.ImageDataGenerator(featurewise_center=False,    samplewise_center=False,    featurewise_std_normalization=False,    samplewise_std_normalization=False,    zca_whitening=False,    rotation_range=0.,    width_shift_range=0.,    height_shift_range=0.,    shear_range=0.,    zoom_range=0.,    channel_shift_range=0.,    fill_mode='nearest',    cval=0.,    horizontal_flip=False,    vertical_flip=False,    rescale=None,    preprocessing_function=None,    data_format=K.image_data_format())

用以生成一个batch的图像数据，支持实时数据提升。训练时该函数会无限生成数据，直到达到规定的epoch次数为止。

例子

使用.flow()的例子

(x_train, y_train), (x_test, y_test) = cifar10.load_data()y_train = np_utils.to_categorical(y_train, num_classes)y_test = np_utils.to_categorical(y_test, num_classes)datagen = ImageDataGenerator(    featurewise_center=True,    featurewise_std_normalization=True,    rotation_range=20,    width_shift_range=0.2,    height_shift_range=0.2,    horizontal_flip=True)# compute quantities required for featurewise normalization# (std, mean, and principal components if ZCA whitening is applied)datagen.fit(x_train)# fits the model on batches with real-time data augmentation:model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),                    steps_per_epoch=len(x_train), epochs=epochs)# here's a more "manual" examplefor e in range(epochs):    print 'Epoch', e    batches = 0    for x_batch, y_batch in datagen.flow(x_train, y_train, batch_size=32):        loss = model.train(x_batch, y_batch)        batches += 1        if batches >= len(x_train) / 32:            # we need to break the loop by hand because            # the generator loops indefinitely            break

使用.flow_from_directory(directory)的例子

train_datagen = ImageDataGenerator(        rescale=1./255,        shear_range=0.2,        zoom_range=0.2,        horizontal_flip=True)test_datagen = ImageDataGenerator(rescale=1./255)train_generator = train_datagen.flow_from_directory(        'data/train',        target_size=(150, 150),        batch_size=32,        class_mode='binary')validation_generator = test_datagen.flow_from_directory(        'data/validation',        target_size=(150, 150),        batch_size=32,        class_mode='binary')model.fit_generator(        train_generator,        steps_per_epoch=2000,        epochs=50,        validation_data=validation_generator,        validation_steps=800)

同时变换图像和mask

# we create two instances with the same argumentsdata_gen_args = dict(featurewise_center=True,                     featurewise_std_normalization=True,                     rotation_range=90.,                     width_shift_range=0.1,                     height_shift_range=0.1,                     zoom_range=0.2)image_datagen = ImageDataGenerator(**data_gen_args)mask_datagen = ImageDataGenerator(**data_gen_args)# Provide the same seed and keyword arguments to the fit and flow methodsseed = 1image_datagen.fit(images, augment=True, seed=seed)mask_datagen.fit(masks, augment=True, seed=seed)image_generator = image_datagen.flow_from_directory(    'data/images',    class_mode=None,    seed=seed)mask_generator = mask_datagen.flow_from_directory(    'data/masks',    class_mode=None,    seed=seed)# combine generators into one which yields image and maskstrain_generator = zip(image_generator, mask_generator)model.fit_generator(    train_generator,    steps_per_epoch=2000,    epochs=50)

阅读全文

0 0