Postmodel Workflow

来源：互联网发布：imap发件箱端口区别编辑：程序博客网时间：2024/06/09 16:15

K-Fold交叉验证

这篇文章将会讲如何进行模型验证和选择，首先说的就是k折交叉验证。
就是说我们将数据集分为K部分，K-1部分作为训练集，剩下的1部分作为测试集，如此反复，我们可以得到K个验证集的误差，然后均值就是最后的测试误差。

from sklearn.datasets import make_regressionfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import train_test_splitX, y = make_regression(1000, shuffle=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)kf = KFold(n_splits=4)output_string = "Fold: {}, N_train: {}, N_test: {}"i=0for train_index, test_index in kf.split(X_train):    # print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    print(output_string.format(i, len(X_train), len(X_test)))    i = i+1

可以看到输出

Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200

这样我们就可以每次使用K-1个子训练集训练一个模型，然后1个子测试集进行测试。

from sklearn.datasets import make_classificationfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionX, y = make_classification(1000)kf = KFold(n_splits=4)scores=[]for train_index, test_index in kf.split(X):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    lr = LogisticRegression()    lr.fit(X_train,y_train)    scores.append(lr.score(X_test,y_test))print(scores)print(sum(scores)/len(scores))

得到的分数是平均分数。这里交叉验证得到的是C=1.0，我们可以使用不同的C来进行测试，得到最小的测试误差的C。（C是正则项前的系数）

StratifiedKFold 分层K折

这个函数也是K折交叉验证，但是与KFold不同的是，KFold是根据样本直接折叠的，如果样本不均匀，折叠后得到的样本类别也是不均匀的，而StratifiedKFold则是通过保留每个类的样本的百分比进行折叠的，保证了折叠后样本的还是均匀的。

from sklearn import datasetsfrom sklearn.model_selection import KFold, StratifiedKFoldiris = datasets.load_iris()iris.targetOut:array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

可以看到这里样本是分布不均匀的，我们直接使用KFold来看看

kf = KFold(n_splits=3)for train_index,test_index in kf.split(iris.target):    train_X,test_X = iris.data[train_index], iris.data[test_index]    train_y,test_y = iris.target[train_index], iris.target[test_index]    print(test_y)Out:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

测试集的数据只有一个类别，可以想到训练集应该只有两个类别，这样得到的模型肯定是最糟糕的。我们试试StratifiedKFold

skf = StratifiedKFold(n_splits=3)for train_index,test_index in skf.split(iris.data,iris.target):    train_X,test_X = iris.data[train_index], iris.data[test_index]    train_y,test_y = iris.target[train_index], iris.target[test_index]    print(test_y)Out:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

可以看到三次划分，测试集每个类都是按照原来比例划分的。分层K折的split函数需要类别标签，可以看到主要是用于分类，而KFold只是需要样本的数组，无论是样本数据还是样本标签，根据这个数组得到每次折叠的index。所以分类需要使用StratifiedKFold（回归根本没有标签，想用也不能用啊），回归则是KFold。

自动交叉验证

上面是我们需要手动来进行交叉验证，这里有自动的方式。

from sklearn.linear_model import LogisticRegressionfrom sklearn import datasetsfrom sklearn import model_selectionlr = LogisticRegression()X, y = datasets.make_classification(10000)scores = model_selection.cross_val_score(lr, X, y)print(scores)

可以看到cross_val_score直接得到交叉验证的分数（参数cv默认为3），与此相关的一个函数是cross_val_predict，这个函数直接得到的预测后每个样本的预测结果。

也可以通过详细参数verbose来得到更具体的显示

scores = model_selection.cross_val_score(lr, X, y, verbose=3,cv=4)

输出为

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.2s finished[CV]  ................................................................[CV] ................................. , score=0.882800, total=   0.0s[CV]  ................................................................[CV] ................................. , score=0.879600, total=   0.0s[CV]  ................................................................[CV] ................................. , score=0.874400, total=   0.0s[CV]  ................................................................[CV] ................................. , score=0.880400, total=   0.0s[ 0.8828  0.8796  0.8744  0.8804]

ShuffleSplit

ShuffleSplit是train_test_split的一个变形，划分K次，每次都以给定的比例将数据集划分为测试集和训练集

from sklearn.model_selection import ShuffleSplitX = np.arange(5)ss = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)for train_index, test_index in ss.split(X):    print("%s %s" % (train_index, test_index))

重复3次划分，每次测试集比例为0.25，输出

[1 3 4] [2 0][1 4 3] [0 2][4 0 2] [1 3]

这里random_state可以控制划分数据的随机性，如果给定值，那么每次划分的的数据集都是相同的，否则每次都重新划分数据集。

for train_index, test_index in ss.split(X):    print("%s %s" % (train_index, test_index))

输出

[1 3 4] [2 0][1 4 3] [0 2][4 0 2] [1 3]

可以看到两次划分的数据都一样。

网格搜索

有时候超参数有很多个，我们可以使用网格搜索来可视化所有的参数，得到不同参数组合的分数。例如一个基本的决策树分类器有2个超参数，最多有几个特征max_features，分割度量criterion。我们将参数空间划分为2维用来可视化。

c r i t e r i o n = {g i n i, e n t r o p y}

max_features={auto,log2,None}

参数的组合数量

p a r a m e t e r s p a c e = c r i t e r i o n * m a x_f e a t u r e s

这里使用itertools可以将参数组合。

from sklearn.model_selection import train_test_splitfrom sklearn import datasetsfrom sklearn.tree import DecisionTreeClassifierimport itertoolsX,y = datasets.make_classification(n_samples=2000,n_features=10)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)accuracies = {}criteria = {'gini','entropy'}max_features = {'auto','log2','sqrt'}parameter_space = itertools.product(criteria,max_features)for criterion, max_feature in parameter_space:    dt = DecisionTreeClassifier(criterion=criterion,max_features=max_feature)    dt.fit(X_train,y_train)    accuracies[(criterion,max_feature)]=(dt.predict(X_test)==y_test).mean()from matplotlib import pyplot as pltfig, ax = plt.subplots()ax.set_xticklabels([''] + list(criteria))ax.set_yticklabels([''] + list(max_features))plot_array = []for max_feature in max_features:    m = []    for criterion in criteria:        m.append(accuracies[(criterion, max_feature)])    plot_array.append(m)colors = ax.matshow(plot_array)fig.colorbar(colors)plt.show()

这里写图片描述

强制搜索

自动化的测试超参数，得到最优模型。网格搜索有两种，一种是知道了具体的参数，在参数集合中选取最有的参数，使用GridSearchCV。另一个种并不知道具体的参数，但是知道参数的分布，使用RandomizedSearchCV。

import numpy as npimport pprintimport scipy.stats as stfrom sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import GridSearchCV, RandomizedSearchCVX, y = make_classification(1000, n_features=5)lr = LogisticRegression(class_weight='balanced')lr.fit(X, y)grid_search_params = {'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4]}random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)}gs = GridSearchCV(lr, grid_search_params)gs.fit(X, y)pprint.pprint(gs.cv_results_ )pprint.pprint(gs.best_params_)pprint.pprint(gs.best_score_)

我们可以得到最好的分数，超参数，估计器。

规则化和模型选择
Tuning the hyper-parameters of an estimator
Cross-validation: evaluating estimator performance

阅读全文

0 0