Python Machine Learning chap3

来源:互联网 发布:淘宝护肤品店铺标志 编辑:程序博客网 时间:2024/06/04 19:16

Chapter3

A Tour of Machine Learning Classifiers Using Scikit-learn

3topoics:
- 主流分类算法
- Using the scikit-learn machine learning library
- Questions to ask when selecting a machine learning algorithm

Chosing a classification algorithm

训练机器学习算法的五个步骤:

  1. Selection of features.
  2. Choosing a performance metric.
  3. Choosing a classifier and optimization algorithm.
  4. Evaluating the performance of the model.
  5. Tuning the algorithm

First steps with scikit-learn

Now we will take a look at the scikit-learn API, which **combines a user-friendly interface with a highly optimized
implementation of several classification algorithms**.

However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models

Training a perceptron via scikit-learn

数据集还是Iris,将前150个petal length和petal width作为特征矩阵X,对应标签作为特征y:

>>> from sklearn import datasets>>> import numpy as np>>> iris = datasets.load_iris()>>> X = iris.data[:, [2, 3]]>>> y = iris.target

如果执行np.unique(y) ,则返回一个不同类的标签,存入iris.target,可以看到类名:Iris-Setosa, Iris-Versicolor,
and Iris-Virginica分别被标记为整数(0,1,2)。
为了评估训练模型的好坏,将数据集分为测试集(30%,45个样本)和训练集(70%,105个样本):

>>> from sklearn.cross_validation import train_test_split #train_test_split函数讲数据集随机分为两部分,测试集(30%,45个样本)和训练集(70%,105个样本)>>> X_train, X_test, y_train, y_test = train_test_split(... X, y, test_size=0.3, random_state=0)

特征归一化:

>>> from sklearn.preprocessing import StandardScaler>>> sc = StandardScaler()>>> sc.fit(X_train)#用fit方法估计参数$\mu$和参数$\sigma$>>> X_train_std = sc.transform(X_train) #用参数mu和sigma标准化训练数据和测试数据>>> X_test_std = sc.transform(X_test)

scikit-learn里面大多数算法都支持多分类,默认调用One-vs.-Rest方法,这样我们可以一次性输入三种花的类别:

>>> from sklearn.linear_model import Perceptron>>> ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)>>> ppn.fit(X_train_std, y_train)

sklearn里面的perceptron,和之前我们自己定义的perceptron可以说是很像了。eta0对应与之前的eta,都表示学习率,n_iter都表示迭代次数。random_state用来在每一轮迭代之后再现初始数据集。

预测部分:

>>> y_pred = ppn.predict(X_test_std)>>> print('Misclassified samples: %d' % (y_test != y_pred).sum())Misclassified samples: 4

Scikit-learn 也实现了许多不同的性能度量。比如我们可以计算测试集的分类精度:

>>> from sklearn.metrics import accuracy_score>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))0.91

y_test y_pred 分别是测试集真实标签和预测标签。

可视化结果:

from matplotlib.colors import ListedColormapimport matplotlib.pyplot as pltdef plot_decision_regions(X, y, classifier,test_idx=None, resolution=0.02):# setup marker generator and color mapmarkers = ('s', 'x', 'o', '^', 'v')colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')cmap = ListedColormap(colors[:len(np.unique(y))])# plot the decision surfacex1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolution))Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)Z = Z.reshape(xx1.shape)plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)plt.xlim(xx1.min(), xx1.max())plt.ylim(xx2.min(), xx2.max())# plot all samplesfor idx, cl in enumerate(np.unique(y)):plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],alpha=0.8, c=cmap(idx),marker=markers[idx], label=cl)# highlight test samplesif test_idx:X_test, y_test = X[test_idx, :], y[test_idx]plt.scatter(X_test[:, 0], X_test[:, 1], c='',alpha=1.0, linewidths=1, marker='o',s=55, label='test set')

now specify the indices of the samples that we want to mark on the resulting plots:

>>> X_combined_std = np.vstack((X_train_std, X_test_std))>>> y_combined = np.hstack((y_train, y_test))>>> plot_decision_regions(X=X_combined_std,... y=y_combined,... classifier=ppn,... test_idx=range(105,150))>>> plt.xlabel('petal length [standardized]')>>> plt.ylabel('petal width [standardized]')>>> plt.legend(loc='upper left')>>> plt.show()

感知机算法在非完全线性可分的数据集上,从不收敛。所以一般not recommended。

Modeling class probabilities via logistic regression

上面说到感知机算法从不收敛,直观的说,可以考虑由于权重是连续被更新的,所以每次迭代总会有至少一个被误分的样本。

Logistic regression intuition and conditional probabilities

odds ratio,直译的话就是胜率,比值比。可以写作

p1p
,pda

0 0