Python Machine Learning chap3
来源:互联网 发布:淘宝护肤品店铺标志 编辑:程序博客网 时间:2024/06/04 19:16
Chapter3
A Tour of Machine Learning Classifiers Using Scikit-learn
3topoics:
- 主流分类算法
- Using the scikit-learn machine learning library
- Questions to ask when selecting a machine learning algorithm
Chosing a classification algorithm
训练机器学习算法的五个步骤:
- Selection of features.
- Choosing a performance metric.
- Choosing a classifier and optimization algorithm.
- Evaluating the performance of the model.
- Tuning the algorithm
First steps with scikit-learn
Now we will take a look at the scikit-learn API, which **combines a user-friendly interface with a highly optimized
implementation of several classification algorithms**.
However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models
Training a perceptron via scikit-learn
数据集还是Iris,将前150个petal length和petal width作为特征矩阵X,对应标签作为特征y:
>>> from sklearn import datasets>>> import numpy as np>>> iris = datasets.load_iris()>>> X = iris.data[:, [2, 3]]>>> y = iris.target
如果执行np.unique(y)
,则返回一个不同类的标签,存入iris.target
,可以看到类名:Iris-Setosa, Iris-Versicolor,
and Iris-Virginica分别被标记为整数(0,1,2)。
为了评估训练模型的好坏,将数据集分为测试集(30%,45个样本)和训练集(70%,105个样本):
>>> from sklearn.cross_validation import train_test_split #train_test_split函数讲数据集随机分为两部分,测试集(30%,45个样本)和训练集(70%,105个样本)>>> X_train, X_test, y_train, y_test = train_test_split(... X, y, test_size=0.3, random_state=0)
特征归一化:
>>> from sklearn.preprocessing import StandardScaler>>> sc = StandardScaler()>>> sc.fit(X_train)#用fit方法估计参数$\mu$和参数$\sigma$>>> X_train_std = sc.transform(X_train) #用参数mu和sigma标准化训练数据和测试数据>>> X_test_std = sc.transform(X_test)
scikit-learn里面大多数算法都支持多分类,默认调用One-vs.-Rest方法,这样我们可以一次性输入三种花的类别:
>>> from sklearn.linear_model import Perceptron>>> ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)>>> ppn.fit(X_train_std, y_train)
sklearn里面的perceptron,和之前我们自己定义的perceptron可以说是很像了。eta0
对应与之前的eta
,都表示学习率,n_iter
都表示迭代次数。random_state
用来在每一轮迭代之后再现初始数据集。
预测部分:
>>> y_pred = ppn.predict(X_test_std)>>> print('Misclassified samples: %d' % (y_test != y_pred).sum())Misclassified samples: 4
Scikit-learn 也实现了许多不同的性能度量。比如我们可以计算测试集的分类精度:
>>> from sklearn.metrics import accuracy_score>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))0.91
y_test
y_pred
分别是测试集真实标签和预测标签。
可视化结果:
from matplotlib.colors import ListedColormapimport matplotlib.pyplot as pltdef plot_decision_regions(X, y, classifier,test_idx=None, resolution=0.02):# setup marker generator and color mapmarkers = ('s', 'x', 'o', '^', 'v')colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')cmap = ListedColormap(colors[:len(np.unique(y))])# plot the decision surfacex1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolution))Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)plt.xlim(xx1.min(), xx1.max())plt.ylim(xx2.min(), xx2.max())# plot all samplesfor idx, cl in enumerate(np.unique(y)):plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],alpha=0.8, c=cmap(idx),marker=markers[idx], label=cl)# highlight test samplesif test_idx:X_test, y_test = X[test_idx, :], y[test_idx]plt.scatter(X_test[:, 0], X_test[:, 1], c='',alpha=1.0, linewidths=1, marker='o',s=55, label='test set')
now specify the indices of the samples that we want to mark on the resulting plots:
>>> X_combined_std = np.vstack((X_train_std, X_test_std))>>> y_combined = np.hstack((y_train, y_test))>>> plot_decision_regions(X=X_combined_std,... y=y_combined,... classifier=ppn,... test_idx=range(105,150))>>> plt.xlabel('petal length [standardized]')>>> plt.ylabel('petal width [standardized]')>>> plt.legend(loc='upper left')>>> plt.show()
感知机算法在非完全线性可分的数据集上,从不收敛。所以一般not recommended。
Modeling class probabilities via logistic regression
上面说到感知机算法从不收敛,直观的说,可以考虑由于权重是连续被更新的,所以每次迭代总会有至少一个被误分的样本。
Logistic regression intuition and conditional probabilities
odds ratio,直译的话就是胜率,比值比。可以写作
- Python Machine Learning chap3
- Machine Learning (Python)
- python for machine-learning
- Python Machine Learning
- Python Machine Learning : Chap2
- Machine Learning in Python
- Machine Learning Library for Python
- Python Machine Learning---scikit-learn
- Python machine learning Ridge Regression
- Learning Scikit-learn Machine Learning in Python
- python机器学习(Python Machine Learning)
- scikit-learn: machine learning in Python
- 【ML】【python】Machine Learning in Action
- Machine Learning in Python part 1
- Machine Learning in Python part 2
- Machine Learning in Python (Scikit-learn)-(转)
- 【用Python玩Machine Learning】KNN * 序
- 【用Python玩Machine Learning】KNN * 测试
- 程序员薪资的天花板在哪里?高薪程序员的5个工资档次
- 关于启动Activity
- nodejs 框架之express
- 一个软件实现的Linux看门狗—soft_wdt
- 网络通信
- Python Machine Learning chap3
- 设计模式学习系列之适配器模式
- TCP/IP协议入门篇
- 唐突的javascript!
- [蓝桥杯]2017.3.19 B题
- 记一次解决Oracle数据库连接失败
- dalsa 8k线阵网口相机c#开发
- C++的类、操作符重载与派生类
- 完善MYSCHOOL 三层架构 存储 BLL DAL UI