sklearn: model_selection

来源：互联网发布：mac有风扇声音很大编辑：程序博客网时间：2024/05/21 10:10

model_selection 主要提供 交叉验证 和 结果评估 的工具, cross_validation 模块作为老版本中的模块, 拥有同样的方法, 在0.20.0版本中将会本移除, 因此尽量不要使用cross_validation 模块.

sklearn.model_selection.train_test_split()

作用: 随机地将样本集合分为训练集和测试集.

'''参数:    *arrays: 被分割的数据集, 他们的长度(shape[0]或len)相同, 可以接受lists, numpy arrays, dataframes类型的数据;    test_size: 测试集大小, 可以为float, int, None, 默认为None;        float: 值在0.0到1.0中间, 表示测试集占数据集的比例大小;        int: 测试集包含的样本的绝对数值;        None: 根据train_size自动设置, 如果train_size也为None, test_size将被设置为0.25;    train_size: 训练集大小, 可以为float, int, None, 默认为None;        float: 值在0.0到1.0中间, 表示训练集占数据集的比例大小;        int: 训练集包含的样本的绝对数值;        None: 根据test_size自动设置, 如果test_size也为None, train_size将被设置为0.75;    random_state: 随机种子;    stratify: 对样本类别的标记列表, 默认为None, 可以设置为array-like, 长度与被分割的数据array中的长度相同, 其中的每个值表示对应样本的所属类别, 如[0,0,0,1,1]表示前3个样本数据属于第0类, 后两个样本数据属于第1类;    随机取样本时, 如果该参数不是None, 则在每个类别的样本中按比例抽取训练集和测试集, 即抽取后的训练集总样本和测试集总样本中的不同类别的样本比例保持不变.'''# stratify参数使用方法X = np.random.rand(100, 3)y = np.random.rand(100) * 2 + 100stratify = np.ones(100)stratify[50:] = 2print(train_test_split(X, y, stratify=stratify))

以下为进行样本划分的类, 拥有共同的方法get_n_splits()和split()

首先介绍这两个共有方法:

get_n_splits()

'''作用: 返回分割后的子样本集合(Folds)的数量, 即划分类初始化时设置的n_splits参数.参数:    X: (must)被划分的数据;返回: 返回分割后的子样本集合(Folds)的数量, 即划分类初始化时设置的n_splits参数.   '''

split()

'''作用: 对样本进行划分, 代码生成使用yield, 因此使用for循环进行调用, 每次返回一次分割结果.参数:    X: (must)样本特征, shape为(n_samples, n_features);    y: 样本标签, 长度为n_samples    groups: shape为(n_samples,), 每个样本对应的集合分类, 如[0,0,0,1,1].输出:    train: 本次分割产生的训练集样本索引(indices);    test: 本次分割产生的测试集样本索引(indices);'''

下面是样本划分类

class sklearn.model_selection.KFold

'''作用: K-Folds产生交叉验证集, 即将样本集分割成K个子样本集合(即一个Fold), 每个Fold都会作为交叉验证集.参数:    n_splits: 划分成的子集合(Folds)的数量, 默认为3, 不得小于2;    shuffle: bool, 在划分之前是否随机打乱, 默认为False, 即不打乱按顺序分割样本;    random_state: 打乱样本的随机种子''''''这里的split方法可以循环n_splits次'''from sklearn.model_selection import KFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])y = np.array([1, 2, 3, 4])kf = KFold(n_splits=2)print(kf.get_n_splits(X))# 输出: 2print(kf)  # 输出: KFold(n_splits=2, random_state=None, shuffle=False)for train_index, test_index in kf.split(X):    print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]'''输出结果:TRAIN: [2 3] TEST: [0 1]TRAIN: [0 1] TEST: [2 3]'''

class sklearn.model_selection.GroupKFold

'''作用: 必须对样本进行类别分类, 每类样本不会出现在两个Fold中. 用来平衡每类样本. 注意样本类别的数量至少要等于Folds的数量, 不得小于.参数:    n_splits: 划分成的子集合(Folds)的数量, 默认为3, 不得小于2;'''from sklearn.model_selection import GroupKFoldX = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])y = np.array([1, 2, 3, 4])groups = np.array([0, 0, 2, 2])group_kfold = GroupKFold(n_splits=2)group_kfold.get_n_splits(X, y, groups)# 输出: 2print(group_kfold)# 输出: GroupKFold(n_splits=2)for train_index, test_index in group_kfold.split(X, y, groups):    print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    print(X_train, X_test, y_train, y_test)'''输出:TRAIN: [0 1] TEST: [2 3][[1 2] [3 4]] [[5 6] [7 8]][1 2][3 4]TRAIN: [2 3] TEST: [0 1][[5 6] [7 8]][[1 2] [3 4]][3 4][1 2]'''

class sklearn.model_selection.StratifiedKFold

'''作用: 根据样本的标签, 每个Fold中不同标签的样本比例相同, 等于整个样本集的不同标签比例.参数:    n_splits: 划分成的子集合(Folds)的数量, 默认为3, 不得小于2;    shuffle: bool, 在划分之前是否随机打乱, 默认为False, 即不打乱按顺序分割样本;    random_state: 打乱样本的随机种子.'''from sklearn.model_selection import StratifiedKFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])y = np.array([0, 0, 1, 1])skf = StratifiedKFold(n_splits=2)print(skf.get_n_splits(X, y))# 输出: 2print(skf)  # 输出: StratifiedKFold(n_splits=2, random_state=None, shuffle=False)for train_index, test_index in skf.split(X, y):    print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]'''输出:TRAIN: [1 3] TEST: [0 2]TRAIN: [0 2] TEST: [1 3]'''

sklearn.model_selection.ShuffleSplit

'''作用: 按指定的train和test比例, 随机打乱后, 划分若干次样本集合.参数:    n_splits: 打乱划分次数, 默认为10, 即split()方法可循环的次数;    test_size: 默认为0.1, 使用方法见train_test_split()方法对应的参数;    train_size: 默认为None, 使用方法见train_test_split()方法对应的参数;    random_state: 采样的随机种子.'''from sklearn.model_selection import ShuffleSplitX = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])y = np.array([1, 2, 1, 2])rs = ShuffleSplit(n_splits=3, test_size=.25, random_state=0)rs.get_n_splits(X)# 输出: 3print(rs)# 输出: ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=None)for train_index, test_index in rs.split(X):    print("TRAIN:", train_index, "TEST:", test_index)'''输出:TRAIN: [3 1 0] TEST: [2]TRAIN: [2 1 3] TEST: [0]TRAIN: [0 2 1] TEST: [3]'''rs = ShuffleSplit(n_splits=3, train_size=0.5, test_size=.25, random_state=0)for train_index, test_index in rs.split(X):    print("TRAIN:", train_index, "TEST:", test_index)'''输出:TRAIN: [3 1] TEST: [2]TRAIN: [2 1] TEST: [0]TRAIN: [0 2] TEST: [3]'''

sklearn.model_selection.ShuffleSplit

'''作用: 按指定的train和test比例, 随机打乱后, 划分若干次样本集合, 且train和test中的标签比例都与整个样本集相等, StratifiedKFold与ShuffleSplit的结合.参数:    n_splits: 打乱划分次数, 默认为10, 即split()方法可循环的次数;    test_size: 默认为0.1, 使用方法见train_test_split()方法对应的参数;    train_size: 默认为None, 使用方法见train_test_split()方法对应的参数;    random_state: 采样的随机种子.'''from sklearn.model_selection import StratifiedShuffleSplitX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])y = np.array([0, 0, 1, 1])sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)sss.get_n_splits(X, y)# 输出: 3print(sss)       # 输出: StratifiedShuffleSplit(n_splits=3, random_state=0, ...)for train_index, test_index in sss.split(X, y):    print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]'''输出:TRAIN: [1 2] TEST: [3 0]TRAIN: [0 2] TEST: [1 3]TRAIN: [0 2] TEST: [3 1]'''

阅读全文

0 0