Sklearn.cross_validation模块和数据划分方法

来源：互联网发布：开发java程序的一般步骤编辑：程序博客网时间：2024/06/05 13:30

1、sklearn.cross_validation模块

（1）sklearn.cross_validation.cross_val_score()函数：返回交叉验证后得到的分类率。

详情见http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html

sklearn.cross_validation.cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)

其中部分参数解释：

estimator：是不同的分类器，可以是任何分类器形式。比如逻辑回归得到分类器： clf=sklearn.linear_model.LogisticRegression(C=1.0,penalty='l1',tol=1e-6)

cv：代表不同的cross validation方法，取值可以为int型值、cross-validation生成器或迭代器。默认为None，使用3-fold cross-validation；如果是integer，如cv=5，表明是5-fold cross-validation；如果是对象，则是生成器。另外，如果是一个int值，并且提供了参数y，那么表示使用StratifiedKFold分类方式。

scoring：默认为None，准确率的算法。如果不指定，使用estimator默认自带的准确率算法。

例子：

>>>sklearn.cross_validation.cross_val_score(clf,x,y,cv=5)

array([ 0.81564246,  0.81564246,  0.78651685,  0.78651685,  0.81355932])

（2）sklearn.cross_validation.train_test_split()函数

详情见http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

sklearn.cross_validation.train_test_split(*arrays, **options) 返回将arrays按比例随机划分成训练集和测试集。

其中部分参数解释：

*array：输入样本

train_size：取值在0到1之间，表明所占样本比例。

test_size：取值在0到1之间，表明所占样本比例。如果train_size=None，那么test_size=0.25。

random_state：如果为int，表明是随机数生成器的种子。

2、数据划分方法

（1）K折交叉验证：KFold、GroupKFold、StratifiedKFold

例子：

K-fold：默认采用的CV策略，主要参数包括两个，一个是样本数目，一个是k-fold要划分的份数。

[python] view plain copy

fromsklearn.model_selection import KFold
X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y= np.array([1, 2, 3, 4])
kf= KFold(n_splits=2)
kf.get_n_splits(X)#给出K折的折数，输出为2
print(kf)
#输出为：KFold(n_splits=2, random_state=None,shuffle=False)
for train_index, test_index in kf.split(X):
print("TRAIN:",train_index, "TEST:", test_index)
X_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
#输出：TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

这里，kf.split(X)返回的是X中进行划分后train和test的索引值，另X中数据集的索引值为0,1,2,3；第一次划分，先选择

test，索引为0和1的数据集为test，剩下索引为2和3的数据集为train；第二次划分时，先选择test，索引为2和3的数据集为test，剩下索引为0和1的数据集为train。

Stratified k-fold：与k-fold类似，将数据集划分为k份，不同点在于，划分的k份中，每一份内各个类别数据的比例和原始数据集中各个类别的比例相同。

[python] view plain copy

from sklearn.model_selection import StratifiedKFold
X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y= np.array([0, 0, 1, 1])
skf= StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)#给出K折的折数，输出为2
print(skf)
#输出为：StratifiedKFold(n_splits=2,random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
print("TRAIN:",train_index, "TEST:", test_index)
X_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
#输出：TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]

（2）留一法：LeaveOneGroupOut、LeavePGroupsOut、LeaveOneOut、LeavePOut

例子：

leave-one-out：每个样本单独作为验证集，其余的N-1个样本作为训练集，所以LOO-CV会得到N个模型，用这N个模型最终的验证集得到的分类率的平均数作为此下LOO-CV分类器的性能指标。参数只有一个，即样本数目。

from sklearn.model_selection import LeaveOneOut
X= [1, 2, 3, 4]
loo= LeaveOneOut()
for train, test in loo.split(X):
print("%s%s" % (train, test))
#结果：[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

leave-P-out：每次从整体样本中去除P条样本作为测试集，如果共有n条样本数据，那么会生成(n p)个训练集/测试集对。和LOO，KFold不同，这种策略中p个样本中会有重叠。

from sklearn.model_selection import LeavePOut
X= np.ones(4)
lpo= LeavePOut(p=2)
for train, test in lpo.split(X):
print("%s%s" % (train, test))
#结果：[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

leave-one-label-out：这种策略划分样本时，会根据第三方提供的整数型样本类标号进行划分。每次划分数据集时，取出某个属于某个类标号的样本作为测试集，剩余的作为训练集。

from sklearn.model_selection import LeaveOneLabelOut
labels = [1,1,2,2]
Lolo=LeaveOneLabelOut(labels)
for train, test in lolo:
print("%s%s" % (train, test))
#结果：[2 3] [0 1]
[0 1] [2 3]

leave-P-label-out：与leave-one-label-out类似，但这种策略每次取p种类标号的数据作为测试集，其余作为训练集。

from sklearn.model_selection import LeavePLabelOut
labels = [1,1,2,2,3,3]
Lplo=LeaveOneLabelOut(labels,p=2)
for train, test in lplo:
print("%s%s" % (train, test))
#结果：[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]

（3）随机划分法：ShuffleSplit、GroupShuffleSplit、StratifiedShuffleSplit

阅读全文

0 0