python机器学习模型选择&调参工具Hyperopt-sklearn(1)——综述&分类问题

来源:互联网 发布:股票数据猫网 编辑:程序博客网 时间:2024/06/05 14:25

针对特定的数据集选择合适的机器学习算法是冗长的过程,即使是针对特定的机器学习算法,亦需要花费大量时间和精力调整参数,才能让模型获得好的效果,Hyperopt-sklearn可以辅助解决这样的问题。

主页:http://hyperopt.github.io/hyperopt-sklearn/

安装方法:

git clone https://github.com/hyperopt/hyperopt-sklearn.gitcd hyperoptpip install -e .

基础实例:

from hpsklearn import HyperoptEstimator# Load Data# ...# Create the estimator objectestim = HyperoptEstimator()# Search the space of classifiers and preprocessing steps and their# respective hyperparameters in sklearn to fit a model to the dataestim.fit(train_data, train_label)# Make a prediction using the optimized modelprediction = estim.predict(unknown_data)# Report the accuracy of the classifier on a given set of datascore = estim.score(test_data, test_label)# Return instances of the classifier and preprocessing stepsmodel = estim.best_model()

针对分类问题,可以如下指定HyperoptEstimator

from hyperopt import tpefrom hpsklearn import HyperoptEstimator, any_classifierestim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest)estim.fit(X_train,y_train)

其中any_classifier是常用分类器的集合,根据源码

def any_classifier(name):    return hp.choice('%s' % name, [        svc(name + '.svc'),        knn(name + '.knn'),        random_forest(name + '.random_forest'),        extra_trees(name + '.extra_trees'),        ada_boost(name + '.ada_boost'),        gradient_boosting(name + '.grad_boosting', loss='deviance'),        sgd(name + '.sgd'),    ])

可以发现目前支持的分类器有:
(1)svc(实现基础:sklearn.svm.SVC)
(2)knn(实现基础:sklearn.neighbors.KNeighborsClassifier)
(3)random_forest(实现基础:sklearn.ensemble.RandomForestClassifier)
(4)extra_trees(实现基础:sklearn.ensemble.ExtraTreesClassifier)
(5)ada_boost(实现基础:sklearn.ensemble.AdaBoostClassifier)
(6)gradient_boosting(实现基础:sklearn.ensemble.GradientBoostingClassifier)
(7)sgd(实现基础:sklearn.linear_model.SGDClassifier)

在默认情况下,HyperoptEstimator会尝试对数据进行预处理,根据源码

def any_preprocessing(name):    """Generic pre-processing appropriate for a wide variety of data    """    return hp.choice('%s' % name, [        [pca(name + '.pca')],        [standard_scaler(name + '.standard_scaler')],        [min_max_scaler(name + '.min_max_scaler')],        [normalizer(name + '.normalizer')],        # -- not putting in one-hot because it can make vectors huge        #[one_hot_encoder(name + '.one_hot_encoder')],        []    ])

可以发现目前支持的预处理方法有:
(1)pca(实现基础:sklearn.decomposition.PCA)
(2)standard_scaler(实现基础:sklearn.preprocessing.StandardScaler)
(3)min_max_scaler(实现基础:sklearn.preprocessing.MinMaxScaler)
(4)normalizer(实现基础:sklearn.preprocessing.Normalizer)

分类问题实例:

首先读入数据

import timeimport numpy as npfrom sklearn.datasets import load_digitsfrom sklearn.svm import SVCfrom hyperopt import tpefrom hpsklearn import HyperoptEstimator, any_classifierfrom hpsklearn import svcdigits = load_digits()X = digits.datay = digits.targettest_size = int(0.2*len(y))np.random.seed(0)indices = np.random.permutation(len(X))X_train = X[indices[:-test_size]]y_train = y[indices[:-test_size]]X_test = X[indices[-test_size:]]y_test = y[indices[-test_size:]]

然后进行分类

estim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest)estim.fit(X_train,y_train)print(estim.score(X_test,y_test))print(estim.best_model())

输出如下(可能会有差异)

0.983286908078{'learner': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',           metric_params=None, n_jobs=1, n_neighbors=10, p=2,           weights='uniform'), 'preprocs': (), 'ex_preprocs': ()}

如果希望每次得到相同的结果,可以设置seed参数

# ensure that the result is the sameestim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest, seed=0)estim.fit(X_train,y_train)print(estim.score(X_test,y_test))print(estim.best_model())

输出如下

0.980501392758{'learner': SVC(C=61953.1811067, cache_size=512, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=1, gamma='auto', kernel='linear',  max_iter=18658754.0, probability=False, random_state=3, shrinking=False,  tol=7.18807580055e-05, verbose=False), 'preprocs': (StandardScaler(copy=True, with_mean=False, with_std=True),), 'ex_preprocs': ()}

如果希望针对特定算法进行优化,可以通过classifier参数指定
以SVM为例,优化前测试集准确率39.28%,优化后测试集准确率98.61%

start = time.time()clf1 = SVC( )clf1.fit(X_train, y_train)end = time.time()print 'old test score:', clf1.score(X_test, y_test)print 'old time:', (end-start) , 's'print 'old model:', clf1
old test score: 0.392757660167old time: 0.422000169754 sold model: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False)
# significant improvementstart = time.time()clf2 = HyperoptEstimator(classifier=svc('mySVC'), seed=0)clf2.fit(X_train, y_train)end = time.time()print "new score", clf2.score(X_test, y_test)print 'new time:', (end-start) , 's'print 'new model:', clf2.best_model()
new score 0.986072423398new time: 9.24400019646 snew model: {'learner': SVC(C=3148.38646281, cache_size=512, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=1, gamma=0.0475906452129,  kernel='rbf', max_iter=46434501.0, probability=False, random_state=4,  shrinking=False, tol=0.00158569665523, verbose=False), 'preprocs': (MinMaxScaler(copy=True, feature_range=(-1.0, 1.0)),), 'ex_preprocs': ()}
0 0
原创粉丝点击