×××××#######Keras/Python深度学习中的网格搜索超参数调优(附源码)(译文)+++++++

来源:互联网 发布:电信网络电视如何设置 编辑:程序博客网 时间:2024/04/29 11:01

超参数优化是深度学习中的重要组成部分。其原因在于,神经网络是公认的难以配置,而又有很多参数需要设置。最重要的是,个别模型的训练非常缓慢。

在这篇文章中,你会了解到如何使用scikit-learn python机器学习库中的网格搜索功能调整Keras深度学习模型中的超参数。

阅读本文后,你就会了解:

  • 如何包装Keras模型以便在scikit-learn中使用,以及如何使用网格搜索。
  • 如何网格搜索常见的神经网络参数,如学习速率、 dropout 率、epochs 和神经元数量。
  • 如何设计自己的超参数优化实验。

概述

本文主要想为大家介绍如何使用scikit-learn网格搜索功能,并给出一套代码实例。你可以将代码复制粘贴到自己的项目中,作为项目起始。

下文所涉及的议题列表:

  1. 如何在scikit-learn模型中使用Keras。
  2. 如何在scikit-learn模型中使用网格搜索。
  3. 如何调优批尺寸和训练epochs。
  4. 如何调优优化算法。
  5. 如何调优学习率和动量因子。
  6. 如何确定网络权值初始值。
  7. 如何选择神经元激活函数。
  8. 如何调优Dropout正则化。
  9. 如何确定隐藏层中的神经元的数量。

如何在scikit-learn模型中使用Keras

通过用KerasClassifierKerasRegressor类包装Keras模型,可将其用于scikit-learn。

要使用这些包装,必须定义一个函数,以便按顺序模式创建并返回Keras,然后当构建KerasClassifier类时,把该函数传递给build_fn参数。

例如:

def create_model():    ...    return modelmodel = KerasClassifier(build_fn=create_model)

KerasClassifier类的构建器为可以采取默认参数,并将其被传递给model.fit()的调用函数,比如 epochs数目和批尺寸(batch size)。

例如:

def create_model():    ...    return modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=10)

KerasClassifier类的构造也可以使用新的参数,使之能够传递给自定义的create_model()函数。这些新的参数,也必须由使用默认参数的 create_model() 函数的签名定义。

例如:

def create_model(dropout_rate=0.0):    ...    return modelmodel = KerasClassifier(build_fn=create_model, dropout_rate=0.2)

您可以在Keras API文档中,了解到更多关于scikit-learn包装器的知识。

如何在scikit-learn模型中使用网格搜索

网格搜索(grid search)是一项模型超参数优化技术。

在scikit-learn中,该技术由GridSearchCV类提供。

当构造该类时,你必须提供超参数字典,以便用来评价param_grid参数。这是模型参数名称和大量列值的示意图。

默认情况下,精确度是优化的核心,但其他核心可指定用于GridSearchCV构造函数的score参数。

默认情况下,网格搜索只使用一个线程。在GridSearchCV构造函数中,通过将 n_jobs参数设置为-1,则进程将使用计算机上的所有内核。这取决于你的Keras后端,并可能干扰主神经网络的训练过程。

当构造并评估一个模型中各个参数的组合时,GridSearchCV会起作用。使用交叉验证评估每个单个模型,且默认使用3层交叉验证,尽管通过将cv参数指定给 GridSearchCV构造函数时,有可能将其覆盖。

下面是定义一个简单的网格搜索示例:

param_grid = dict(nb_epochs=[10,20,30])grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)

一旦完成,你可以访问网格搜索的输出,该输出来自结果对象,由grid.fit()返回。best_score_成员提供优化过程期间观察到的最好的评分, best_params_描述了已取得最佳结果的参数的组合。

您可以在scikit-learn API文档中了解更多关于GridSearchCV类的知识。

问题描述

现在我们知道了如何使用scikit-learn 的Keras模型,如何使用scikit-learn 的网格搜索。现在一起看看下面的例子。

所有的例子都将在一个小型的标准机器学习数据集上来演示,该数据集被称为Pima Indians onset of diabetes 分类数据集。该小型数据集包括了所有容易工作的数值属性。

下载数据集,并把它放置在你目前工作目录下,命名为:pima-indians-diabetes.csv

当我们按照本文中的例子进行,能够获得最佳参数。因为参数可相互影响,所以这不是网格搜索的最佳方法,但出于演示目的,它是很好的方法。

注意并行化网格搜索

所有示例的配置为了实现并行化(n_jobs=-1)。

如果显示像下面这样的错误:

INFO (theano.gof.compilelock): Waiting for existing lock by process '55614' (I am process '55613')INFO (theano.gof.compilelock): To manually release the lock, delete ...

结束进程,并修改代码,以便不并行地执行网格搜索,设置n_jobs=1。

如何调优批尺寸和训练epochs

在第一个简单的例子中,当调整网络时,我们着眼于调整批尺寸和训练epochs。

迭代梯度下降的批尺寸大小是权重更新之前显示给网络的模式数量。它也是在网络训练的优选法,定义一次读取的模式数并保持在内存中。

训练epochs是训练期间整个训练数据集显示给网络的次数。有些网络对批尺寸大小敏感,如LSTM复发性神经网络和卷积神经网络。

在这里,我们将以20的步长,从10到100逐步评估不同的微型批尺寸。

完整代码如下:

# Use scikit-learn to grid search the batch size and epochsimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifier# Function to create model, required for KerasClassifierdef create_model():    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, activation='relu'))    model.add(Dense(1, activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, verbose=0)# define the grid search parametersbatch_size = [10, 20, 40, 60, 80, 100]epochs = [10, 50, 100]param_grid = dict(batch_size=batch_size, nb_epoch=epochs)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.686198 using {'nb_epoch': 100, 'batch_size': 20}0.348958 (0.024774) with: {'nb_epoch': 10, 'batch_size': 10}0.348958 (0.024774) with: {'nb_epoch': 50, 'batch_size': 10}0.466146 (0.149269) with: {'nb_epoch': 100, 'batch_size': 10}0.647135 (0.021236) with: {'nb_epoch': 10, 'batch_size': 20}0.660156 (0.014616) with: {'nb_epoch': 50, 'batch_size': 20}0.686198 (0.024774) with: {'nb_epoch': 100, 'batch_size': 20}0.489583 (0.075566) with: {'nb_epoch': 10, 'batch_size': 40}0.652344 (0.019918) with: {'nb_epoch': 50, 'batch_size': 40}0.654948 (0.027866) with: {'nb_epoch': 100, 'batch_size': 40}0.518229 (0.032264) with: {'nb_epoch': 10, 'batch_size': 60}0.605469 (0.052213) with: {'nb_epoch': 50, 'batch_size': 60}0.665365 (0.004872) with: {'nb_epoch': 100, 'batch_size': 60}0.537760 (0.143537) with: {'nb_epoch': 10, 'batch_size': 80}0.591146 (0.094954) with: {'nb_epoch': 50, 'batch_size': 80}0.658854 (0.054904) with: {'nb_epoch': 100, 'batch_size': 80}0.402344 (0.107735) with: {'nb_epoch': 10, 'batch_size': 100}0.652344 (0.033299) with: {'nb_epoch': 50, 'batch_size': 100}0.542969 (0.157934) with: {'nb_epoch': 100, 'batch_size': 100}

我们可以看到,批尺寸为20、100 epochs能够获得最好的结果,精确度约68%。

如何调优训练优化算法

Keras提供了一套最先进的不同的优化算法。

在这个例子中,我们调整用来训练网络的优化算法,每个都用默认参数。

这个例子有点奇怪,因为往往你会先选择一种方法,而不是将重点放在调整问题参数上(参见下一个示例)。

在这里,我们将评估Keras API支持的整套优化算法。

完整代码如下:

# Use scikit-learn to grid search the batch size and epochsimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifier# Function to create model, required for KerasClassifierdef create_model(optimizer='adam'):    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, activation='relu'))    model.add(Dense(1, activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parametersoptimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']param_grid = dict(optimizer=optimizer)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.704427 using {'optimizer': 'Adam'}0.348958 (0.024774) with: {'optimizer': 'SGD'}0.348958 (0.024774) with: {'optimizer': 'RMSprop'}0.471354 (0.156586) with: {'optimizer': 'Adagrad'}0.669271 (0.029635) with: {'optimizer': 'Adadelta'}0.704427 (0.031466) with: {'optimizer': 'Adam'}0.682292 (0.016367) with: {'optimizer': 'Adamax'}0.703125 (0.003189) with: {'optimizer': 'Nadam'}

结果表明,ATOM优化算法结果最好,精确度约为70%。

如何优化学习速率和动量因子?

预先选择一个优化算法来训练你的网络和参数调整是十分常见的。目前,最常用的优化算法是普通的随机梯度下降法(Stochastic Gradient Descent,SGD),因为它十分易于理解。在本例中,我们将着眼于优化SGD的学习速率和动量因子(momentum)。

学习速率控制每批(batch)结束时更新的权重,动量因子控制上次权重的更新对本次权重更新的影响程度。

我们选取了一组较小的学习速率和动量因子的取值范围:从0.2到0.8,步长为0.2,以及0.9(实际中常用参数值)。

一般来说,在优化算法中包含epoch的数目是一个好主意,因为每批(batch)学习量(学习速率)、每个 epoch更新的数目(批尺寸)和 epoch的数量之间都具有相关性。

完整代码如下:

# Use scikit-learn to grid search the learning rate and momentumimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifierfrom keras.optimizers import SGD# Function to create model, required for KerasClassifierdef create_model(learn_rate=0.01, momentum=0):    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, activation='relu'))    model.add(Dense(1, activation='sigmoid'))    # Compile model    optimizer = SGD(lr=learn_rate, momentum=momentum)    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parameterslearn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]param_grid = dict(learn_rate=learn_rate, momentum=momentum)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.680990 using {'learn_rate': 0.01, 'momentum': 0.0}0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.0}0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.2}0.467448 (0.151098) with: {'learn_rate': 0.001, 'momentum': 0.4}0.662760 (0.012075) with: {'learn_rate': 0.001, 'momentum': 0.6}0.669271 (0.030647) with: {'learn_rate': 0.001, 'momentum': 0.8}0.666667 (0.035564) with: {'learn_rate': 0.001, 'momentum': 0.9}0.680990 (0.024360) with: {'learn_rate': 0.01, 'momentum': 0.0}0.677083 (0.026557) with: {'learn_rate': 0.01, 'momentum': 0.2}0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.4}0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.6}0.544271 (0.146518) with: {'learn_rate': 0.01, 'momentum': 0.8}0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.9}0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.0}0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.2}0.572917 (0.134575) with: {'learn_rate': 0.1, 'momentum': 0.4}0.572917 (0.134575) with: {'learn_rate': 0.1, 'momentum': 0.6}0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.8}0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.9}0.533854 (0.149269) with: {'learn_rate': 0.2, 'momentum': 0.0}0.427083 (0.134575) with: {'learn_rate': 0.2, 'momentum': 0.2}0.427083 (0.134575) with: {'learn_rate': 0.2, 'momentum': 0.4}0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.6}0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.8}0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.9}0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.0}0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.2}0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.4}0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.6}0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.8}0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.9}

可以看到,SGD在该问题上相对表现不是很好,但当学习速率为0.01、动量因子为0.0时可取得最好的结果,正确率约为68%。

如何调优网络权值初始化

神经网络权值初始化一度十分简单:采用小的随机数即可。

现在,有许多不同的技术可供选择。点击此处查看Keras 提供的清单。

在本例中,我们将着眼于通过评估所有可用的技术,来调优网络权值初始化的选择。

我们将在每一层采用相同的权值初始化方法。理想情况下,根据每层使用的激活函数选用不同的权值初始化方法效果可能更好。在下面的例子中,我们在隐藏层使用了整流器(rectifier)。因为预测是二进制,因此在输出层使用了sigmoid函数。

完整代码如下:

# Use scikit-learn to grid search the weight initializationimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifier# Function to create model, required for KerasClassifierdef create_model(init_mode='uniform'):    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, init=init_mode, activation='relu'))    model.add(Dense(1, init=init_mode, activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parametersinit_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']param_grid = dict(init_mode=init_mode)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.720052 using {'init_mode': 'uniform'}0.720052 (0.024360) with: {'init_mode': 'uniform'}0.348958 (0.024774) with: {'init_mode': 'lecun_uniform'}0.712240 (0.012075) with: {'init_mode': 'normal'}0.651042 (0.024774) with: {'init_mode': 'zero'}0.700521 (0.010253) with: {'init_mode': 'glorot_normal'}0.674479 (0.011201) with: {'init_mode': 'glorot_uniform'}0.661458 (0.028940) with: {'init_mode': 'he_normal'}0.678385 (0.004872) with: {'init_mode': 'he_uniform'}

我们可以看到,当采用均匀权值初始化方案(uniform weight initialization )时取得最好的结果,可以实现约72%的性能。

如何选择神经元激活函数

激活函数控制着单个神经元的非线性以及何时激活。

通常来说,整流器(rectifier)的激活功能是最受欢迎的,但应对不同的问题, sigmoid函数和tanh 函数可能是更好的选择。

在本例中,我们将探讨、评估、比较Keras提供的不同类型的激活函数。我们仅在隐层中使用这些函数。考虑到二元分类问题,需要在输出层使用sigmoid激活函数。

通常而言,为不同范围的传递函数准备数据是一个好主意,但在本例中我们不会这么做。

完整代码如下:

# Use scikit-learn to grid search the activation functionimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifier# Function to create model, required for KerasClassifierdef create_model(activation='relu'):    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, init='uniform', activation=activation))    model.add(Dense(1, init='uniform', activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parametersactivation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']param_grid = dict(activation=activation)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.722656 using {'activation': 'linear'}0.649740 (0.009744) with: {'activation': 'softmax'}0.720052 (0.032106) with: {'activation': 'softplus'}0.688802 (0.019225) with: {'activation': 'softsign'}0.720052 (0.018136) with: {'activation': 'relu'}0.691406 (0.019401) with: {'activation': 'tanh'}0.680990 (0.009207) with: {'activation': 'sigmoid'}0.691406 (0.014616) with: {'activation': 'hard_sigmoid'}0.722656 (0.003189) with: {'activation': 'linear'}

令人惊讶的是(至少对我来说是),“线性(linear)”激活函数取得了最好的效果,准确率约为72%。

如何调优Dropout正则化

在本例中,我们将着眼于调整正则化中的dropout速率,以期限制过拟合(overfitting)和提高模型的泛化能力。为了得到较好的结果,dropout最好结合一个如最大范数约束之类的权值约束。

了解更多dropout在深度学习框架Keras的使用请查看下面这篇文章:

  • 基于Keras/Python的深度学习模型Dropout正则项

它涉及到拟合dropout率和权值约束。我们选定dropout percentages取值范围是:0.0-0.9(1.0无意义);最大范数权值约束( maxnorm weight constraint)的取值范围是0-5。

完整代码如下:

# Use scikit-learn to grid search the dropout rateimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import Dropoutfrom keras.wrappers.scikit_learn import KerasClassifierfrom keras.constraints import maxnorm# Function to create model, required for KerasClassifierdef create_model(dropout_rate=0.0, weight_constraint=0):    # create model    model = Sequential()    model.add(Dense(12, input_dim=8, init='uniform', activation='linear', W_constraint=maxnorm(weight_constraint)))    model.add(Dropout(dropout_rate))    model.add(Dense(1, init='uniform', activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parametersweight_constraint = [1, 2, 3, 4, 5]dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]param_grid = dict(dropout_rate=dropout_rate, weight_constraint=weight_constraint)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.723958 using {'dropout_rate': 0.2, 'weight_constraint': 4}0.696615 (0.031948) with: {'dropout_rate': 0.0, 'weight_constraint': 1}0.696615 (0.031948) with: {'dropout_rate': 0.0, 'weight_constraint': 2}0.691406 (0.026107) with: {'dropout_rate': 0.0, 'weight_constraint': 3}0.708333 (0.009744) with: {'dropout_rate': 0.0, 'weight_constraint': 4}0.708333 (0.009744) with: {'dropout_rate': 0.0, 'weight_constraint': 5}0.710937 (0.008438) with: {'dropout_rate': 0.1, 'weight_constraint': 1}0.709635 (0.007366) with: {'dropout_rate': 0.1, 'weight_constraint': 2}0.709635 (0.007366) with: {'dropout_rate': 0.1, 'weight_constraint': 3}0.695312 (0.012758) with: {'dropout_rate': 0.1, 'weight_constraint': 4}0.695312 (0.012758) with: {'dropout_rate': 0.1, 'weight_constraint': 5}0.701823 (0.017566) with: {'dropout_rate': 0.2, 'weight_constraint': 1}0.710938 (0.009568) with: {'dropout_rate': 0.2, 'weight_constraint': 2}0.710938 (0.009568) with: {'dropout_rate': 0.2, 'weight_constraint': 3}0.723958 (0.027126) with: {'dropout_rate': 0.2, 'weight_constraint': 4}0.718750 (0.030425) with: {'dropout_rate': 0.2, 'weight_constraint': 5}0.721354 (0.032734) with: {'dropout_rate': 0.3, 'weight_constraint': 1}0.707031 (0.036782) with: {'dropout_rate': 0.3, 'weight_constraint': 2}0.707031 (0.036782) with: {'dropout_rate': 0.3, 'weight_constraint': 3}0.694010 (0.019225) with: {'dropout_rate': 0.3, 'weight_constraint': 4}0.709635 (0.006639) with: {'dropout_rate': 0.3, 'weight_constraint': 5}0.704427 (0.008027) with: {'dropout_rate': 0.4, 'weight_constraint': 1}0.717448 (0.031304) with: {'dropout_rate': 0.4, 'weight_constraint': 2}0.718750 (0.030425) with: {'dropout_rate': 0.4, 'weight_constraint': 3}0.718750 (0.030425) with: {'dropout_rate': 0.4, 'weight_constraint': 4}0.722656 (0.029232) with: {'dropout_rate': 0.4, 'weight_constraint': 5}0.720052 (0.028940) with: {'dropout_rate': 0.5, 'weight_constraint': 1}0.703125 (0.009568) with: {'dropout_rate': 0.5, 'weight_constraint': 2}0.716146 (0.029635) with: {'dropout_rate': 0.5, 'weight_constraint': 3}0.709635 (0.008027) with: {'dropout_rate': 0.5, 'weight_constraint': 4}0.703125 (0.011500) with: {'dropout_rate': 0.5, 'weight_constraint': 5}0.707031 (0.017758) with: {'dropout_rate': 0.6, 'weight_constraint': 1}0.701823 (0.018688) with: {'dropout_rate': 0.6, 'weight_constraint': 2}0.701823 (0.018688) with: {'dropout_rate': 0.6, 'weight_constraint': 3}0.690104 (0.027498) with: {'dropout_rate': 0.6, 'weight_constraint': 4}0.695313 (0.022326) with: {'dropout_rate': 0.6, 'weight_constraint': 5}0.697917 (0.014382) with: {'dropout_rate': 0.7, 'weight_constraint': 1}0.697917 (0.014382) with: {'dropout_rate': 0.7, 'weight_constraint': 2}0.687500 (0.008438) with: {'dropout_rate': 0.7, 'weight_constraint': 3}0.704427 (0.011201) with: {'dropout_rate': 0.7, 'weight_constraint': 4}0.696615 (0.016367) with: {'dropout_rate': 0.7, 'weight_constraint': 5}0.680990 (0.025780) with: {'dropout_rate': 0.8, 'weight_constraint': 1}0.699219 (0.019401) with: {'dropout_rate': 0.8, 'weight_constraint': 2}0.701823 (0.015733) with: {'dropout_rate': 0.8, 'weight_constraint': 3}0.684896 (0.023510) with: {'dropout_rate': 0.8, 'weight_constraint': 4}0.696615 (0.017566) with: {'dropout_rate': 0.8, 'weight_constraint': 5}0.653646 (0.034104) with: {'dropout_rate': 0.9, 'weight_constraint': 1}0.677083 (0.012075) with: {'dropout_rate': 0.9, 'weight_constraint': 2}0.679688 (0.013902) with: {'dropout_rate': 0.9, 'weight_constraint': 3}0.669271 (0.017566) with: {'dropout_rate': 0.9, 'weight_constraint': 4}0.669271 (0.012075) with: {'dropout_rate': 0.9, 'weight_constraint': 5}

我们可以看到,当 dropout率为0.2%、最大范数权值约束( maxnorm weight constraint)取值为4时,可以取得准确率约为72%的最好结果。

如何确定隐藏层中的神经元的数量

每一层中的神经元数目是一个非常重要的参数。通常情况下,一层之中的神经元数目控制着网络的代表性容量,至少是拓扑结构某一节点的容量。

此外,一般来说,一个足够大的单层网络是接近于任何神经网络的,至少在理论上成立。

在本例中,我们将着眼于调整单个隐藏层神经元的数量。取值范围是:1—30,步长为5。

一个大型网络要求更多的训练,此外,至少批尺寸(batch size)和 epoch的数量应该与神经元的数量优化。

完整代码如下:

# Use scikit-learn to grid search the number of neuronsimport numpyfrom sklearn.grid_search import GridSearchCVfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import Dropoutfrom keras.wrappers.scikit_learn import KerasClassifierfrom keras.constraints import maxnorm# Function to create model, required for KerasClassifierdef create_model(neurons=1):    # create model    model = Sequential()    model.add(Dense(neurons, input_dim=8, init='uniform', activation='linear', W_constraint=maxnorm(4)))    model.add(Dropout(0.2))    model.add(Dense(1, init='uniform', activation='sigmoid'))    # Compile model    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    return model# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load datasetdataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")# split into input (X) and output (Y) variablesX = dataset[:,0:8]Y = dataset[:,8]# create modelmodel = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)# define the grid search parametersneurons = [1, 5, 10, 15, 20, 25, 30]param_grid = dict(neurons=neurons)grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)grid_result = grid.fit(X, Y)# summarize resultsprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))for params, mean_score, scores in grid_result.grid_scores_:    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

运行之后输出如下:

Best: 0.714844 using {'neurons': 5}0.700521 (0.011201) with: {'neurons': 1}0.714844 (0.011049) with: {'neurons': 5}0.712240 (0.017566) with: {'neurons': 10}0.705729 (0.003683) with: {'neurons': 15}0.696615 (0.020752) with: {'neurons': 20}0.713542 (0.025976) with: {'neurons': 25}0.705729 (0.008027) with: {'neurons': 30}

我们可以看到,当网络中隐藏层内神经元的个数为5时,可以达到最佳结果,准确性约为71%。

超参数优化的小技巧

本节罗列了一些神经网络超参数调整时常用的小技巧。

  • K层交叉检验(k-fold Cross Validation),你可以看到,本文中的不同示例的结果存在一些差异。使用了默认的3层交叉验证,但也许K=5或者K=10时会更加稳定。认真选择您的交叉验证配置,以确保您的结果是稳定的。
  • 审查整个网络。不要只注意最好的结果,审查整个网络的结果,并寻找支持配置决策的趋势。
  • 并行(Parallelize),如果可以,使用全部的CPU,神经网络训练十分缓慢,并且我们经常想尝试不同的参数。参考AWS实例。
  • 使用数据集的样本。由于神经网路的训练十分缓慢,尝试训练在您训练数据集中较小样本,得到总方向的一般参数即可,并非追求最佳的配置。
  • 从粗网格入手。从粗粒度网格入手,并且一旦缩小范围,就细化为细粒度网格。
  • 不要传递结果。结果通常是特定问题。尽量避免在每一个新问题上都采用您最喜欢的配置。你不可能将一个问题的最佳结果转移到另一个问题之上。相反地,你应该归纳更广泛的趋势,例如层的数目或者是参数之间的关系。
  • 再现性(Reproducibility)是一个问题。在NumPy中,尽管我们为随机数发生器设置了种子,但结果并非百分百重现。网格搜索wrapped Keras模型将比本文中所示Keras模型展现更多可重复性(reproducibility)。

总结

在这篇文章中,你可以了解到如何使用Keras和scikit-learn/Python调优神经网络中的超参数。

尤其是可以学到:

  • 如何包装Keras模型以便在scikit-learn使用以及如何使用网格搜索。
  • 如何网格搜索Keras 模型中不同标准的神经网络参数。
  • 如何设计自己的超参数优化实验。

Hyperparameter optimization is a big part of deep learning.

The reason is that neural networks are notoriously difficult to configure and there are a lot of parameters that need to be set. On top of that, individual models can be very slow to train.

In this post you will discover how you can use the grid search capability from the scikit-learn python machine learning library to tune the hyperparameters of Keras deep learning models.

After reading this post you will know:

  • How to wrap Keras models for use in scikit-learn and how to use grid search.
  • How to grid search common neural network parameters such as learning rate, dropout rate, epochs and number of neurons.
  • How to define your own hyperparameter tuning experiments on your own projects.

Let’s get started.

  • Update Nov/2016: Fixed minor issue in displaying grid search results in code examples.
  • Update Oct/2016: Updated examples for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18.
How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Photo by 3V Photo, some rights reserved.

Overview

In this post I want to show you both how you can use the scikit-learn grid search capability and give you a suite of examples that you can copy-and-paste into your own project as a starting point.

Below is a list of the topics we are going to cover:

  1. How to use Keras models in scikit-learn.
  2. How to use grid search in scikit-learn.
  3. How to tune batch size and training epochs.
  4. How to tune optimization algorithms.
  5. How to tune learning rate and momentum.
  6. How to tune network weight initialization.
  7. How to tune activation functions.
  8. How to tune dropout regularization.
  9. How to tune the number of neurons in the hidden layer.

How to Use Keras Models in scikit-learn

Keras models can be used in scikit-learn by wrapping them with the KerasClassifier orKerasRegressor class.

To use these wrappers you must define a function that creates and returns your Keras sequential model, then pass this function to thebuild_fn argument when constructing theKerasClassifier class.

For example:

The constructor for the KerasClassifier class can take default arguments that are passed on to the calls tomodel.fit(), such as the number of epochs and the batch size.

For example:

The constructor for the KerasClassifier class can also take new arguments that can be passed to your customcreate_model() function. These new arguments must also be defined in the signature of yourcreate_model() function with default parameters.

For example:

You can learn more about the scikit-learn wrapper in Keras API documentation.

How to Use Grid Search in scikit-learn

Grid search is a model hyperparameter optimization technique.

In scikit-learn this technique is provided in the GridSearchCV class.

When constructing this class you must provide a dictionary of hyperparameters to evaluate in theparam_grid argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in thescore argument of theGridSearchCV constructor.

By default, the grid search will only use one thread. By setting the n_jobs argument in theGridSearchCV constructor to -1, the process will use all cores on your machine. Depending on your Keras backend, this may interfere with the main neural network training process.

The GridSearchCV process when then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate each individual model and the default of 3-fold cross validation is used, although this can be overridden by specifying the cv argument to the GridSearchCV constructor.

Below is an example of defining a simple grid search:

Once completed, you can access the outcome of the grid search in the result object returned fromgrid.fit(). Thebest_score_ member provides access to the best score observed during the optimization procedure and thebest_params_ describes the combination of parameters that achieved the best results.

You can learn more about the GridSearchCV class in the scikit-learn API documentation.

Problem Description

Now that we know how to use Keras models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.

All examples will be demonstrated on a small standard machine learning dataset called thePima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with.

  1. Download the dataset and place it in your currently working directly with the namepima-indians-diabetes.csv.

As we proceed through the examples in this post, we will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

Note on Parallelizing Grid Search

All examples are configured to use parallelism (n_jobs=-1).

If you get an error like the one below:

Kill the process and change the code to not perform the grid search in parallel, setn_jobs=1.

 

How to Tune Batch Size and Number of Epochs

In this first simple example, we look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

The full code listing is provided below.

Running this example produces the following output.

We can see that the batch size of 20 and 100 epochs achieved the best result of about 68% accuracy.

How to Tune the Training Optimization Algorithm

Keras offers a suite of different state-of-the-art optimization algorithms.

In this example, we tune the optimization algorithm used to train the network, each with default parameters.

This is an odd example, because often you will choose one approach a priori and instead focus on tuning its parameters on your problem (e.g. see the next example).

Here we will evaluate the suite of optimization algorithms supported by the Keras API.

The full code listing is provided below.

Running this example produces the following output.

The results suggest that the ADAM optimization algorithm is the best with a score of about 70% accuracy.

How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, we will look at optimizing the SGD learning rate and momentum parameters.

Learning rate controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update.

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice).

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size) and the number of epochs.

The full code listing is provided below.

Running this example produces the following output.

We can see that relatively SGD is not very good on this problem, nevertheless best results were achieved using a learning rate of 0.01 and a momentum of 0.0 with an accuracy of about 68%.

How to Tune Network Weight Initialization

Neural network weight initialization used to be simple: use small random values.

Now there is a suite of different techniques to choose from. Keras provides a laundry list.

In this example, we will look at tuning the selection of network weight initialization by evaluating all of the available techniques.

We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary.

The full code listing is provided below.

Running this example produces the following output.

We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 72%.

How to Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In this example, we will evaluate the suite of different activation functions available in Keras. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.

The full code listing is provided below.

Running this example produces the following output.

Surprisingly (to me at least), the ‘linear’ activation function achieved the best results with an accuracy of about 72%.

How to Tune Dropout Regularization

In this example, we will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

To get good results, dropout is best combined with a weight constraint such as the max norm constraint.

For more on using dropout in deep learning models with Keras see the post:

  • Dropout Regularization in Deep Learning Models With Keras

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.

The full code listing is provided below.

Running this example produces the following output.

We can see that the dropout rate of 0.2% and the maxnorm weight constraint of 4 resulted in the best accuracy of about 72%.

How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network,at least in theory.

In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

The full code listing is provided below.

Running this example produces the following output.

We can see that the best results were achieved with a network with 5 neurons in the hidden layer with an accuracy of about 71%.

Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

  • k-fold Cross Validation. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.
  • Review the Whole Grid. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.
  • Parallelize. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot ofAWS instances.
  • Use a Sample of Your Dataset. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
  • Start with Coarse Grids. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
  • Do not Transfer Results. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
  • Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

Summary

In this post, you discovered how you can tune the hyperparameters of your deep learning networks in Python using Keras and scikit-learn.

Specifically, you learned:

  • How to wrap Keras models for use in scikit-learn and how to use grid search.
  • How to grid search a suite of different standard neural network parameters for Keras models.
  • How to design your own hyperparameter optimization experiments.

Do you have any experience tuning hyperparameters of large neural networks? Please share your stories below.

Do you have any questions about hyperparameter optimization of neural networks or about this post? Ask your questions in the comments and I will do my best to answer.


0 0