机器学习
来源:互联网 发布:腾讯云和阿里云学生 编辑:程序博客网 时间:2024/04/29 09:38
1. bagging
bagging的核心思想: bootstrap sample自主采样;即从m个样本的数据集中有放回的随机采样n(n ≤ m)个样本;这样重复k次,将得到k个不同的数据集作为训练数据。
随机森林:随机森林是bagging的经典应用;随机森林相对于bagging更加高级,不但可以对样本进行bootstrap sampling,而且可以对特征进行bootstrap;从而形成拥有多个决策树的森林。
优点:并行集成化容易; 降低模型方差; 抗过拟合能力强。
# -*- coding: utf-8 -*-""" 利用sklearn来实现bagging """from sklearn.ensemble import BaggingClassifierfrom sklearn.neighbors import KNeighborsClassifierdef load_data(): passdef bagging(train_x, train_y, test): """ bagging """ clf = KNeighborsClassifier(n_neighbors=15) bagging_clf = BaggingClassifier(clf, n_estimators=50, max_samples=0.9, max_features=0.9, bootstrap=True, bootstrap_features=True, n_jobs=1, random_state=1) bagging_clf.fit(train_x, train_y) prediction = bagging_clf.predict(test) return predictiondef main(): (train_x, train_y, test) = load_data() prediction = bagging(train_x, train_y, test) return predictionif __name__ == "__main__": prediction = main()
2. boosting
boosting的核心思想:调整样本分布。
boosting的过程:预测,改变样本分布;预测,改变样本分布;...;加权融合。
左边的条形图象征着样本分布的改变,三角连接为分类结果融合的权值。
优点 & 缺点: 深度学习之前的屠龙刀,精度高;但是容易过拟合,并行化难以实现。
3. stacking
stacking的核心:在训练集上进行预测,从而构建更高层的学习器。
3.1 stacking训练过程
1) 拆解训练集。将训练数据随机且大致均匀的拆为m份
2)在拆解后的训练集上训练模型,同时在测试集上预测。利用m-1份训练数据进行训练,预测剩余一份;在此过程进行的同时,利用相同的m-1份数据训练,在真正的测试集上预测;如此重复m次,将训练集上m次结果叠加为1列,将测试集上m次结果取均值融合为1列。
3)使用k个分类器重复2过程。将分别得到k列训练集的预测结果,k列测试集预测结果。
4)训练3过程得到的数据。将k列训练集预测结果和训练集真实label进行训练,将k列测试集预测结果作为测试集。
# -*- coding: utf-8 -*-import numpy as np from sklearn.model_selection import StratifiedKFoldfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierimport xgboost as xgbfrom sklearn.ensemble import ExtraTreesClassifierfrom sklearn.linear_model import LogisticRegressiondef load_data(): passdef stacking(train_x, train_y, test): """ stacking input: train_x, train_y, test output: test的预测值 clfs: 5个一级分类器 dataset_blend_train: 一级分类器的prediction, 二级分类器的train_x dataset_blend_test: 二级分类器的test """ # 5个一级分类器 clfs = [SVC(C = 3, kernel="rbf"), RandomForestClassifier(n_estimators=100, max_features="log2", max_depth=10, min_samples_leaf=1, bootstrap=True, n_jobs=-1, random_state=1), KNeighborsClassifier(n_neighbors=15, n_jobs=-1), xgb.XGBClassifier(n_estimators=100, objective="binary:logistic", gamma=1, max_depth=10, subsample=0.8, nthread=-1, seed=1), ExtraTreesClassifier(n_estimators=100, criterion="gini", max_features="log2", max_depth=10, min_samples_split=2, min_samples_leaf=1,bootstrap=True, n_jobs=-1, random_state=1)] # 二级分类器的train_x, test dataset_blend_train = np.zeros((train_x.shape[0], len(clfs)), dtype=np.int) dataset_blend_test = np.zeros((test.shape[0], len(clfs)), dtype=np.int) # 5个分类器进行8_folds预测 n_folds = 8 skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=1) for i,clf in enumerate(clfs): dataset_blend_test_j = np.zeros((test.shape[0], n_folds)) # 每个分类器的单次fold预测结果 for j,(train_index,test_index) in enumerate(skf.split(train_x, train_y)): tr_x = train_x[train_index] tr_y = train_y[train_index] clf.fit(tr_x, tr_y) dataset_blend_train[test_index, i] = clf.predict(train_x[test_index]) dataset_blend_test_j[:, j] = clf.predict(test) dataset_blend_test[:, i] = dataset_blend_test_j.sum(axis=1) // (n_folds//2 + 1) # 二级分类器进行预测 clf = LogisticRegression(penalty="l1", tol=1e-6, C=1.0, random_state=1, n_jobs=-1) clf.fit(dataset_blend_train, train_y) prediction = clf.predict(dataset_blend_test) return predictiondef main(): (train_x, train_y, test) = load_data() prediction = stacking(train_x, train_y, test) return prediction if __name__ == "__main__": prediction = main()