支持向量机原理与实践（二）：scikit-learn中SVM的使用

来源：互联网发布：sql select as 用法编辑：程序博客网时间：2024/05/21 08:51

我在上一篇博客支持向量机（SVM）原理与实践（一）中介绍了支持向量机的主要原理，在这一篇文章中我介绍一下scikit-learn中SVM的使用，scikit-learn是使用非常广泛的Pythond的机器学习的库，我按照它的官方文档中的例子基本走了一遍，，例子中涉及的主要还有numpy和matplotlib这两个库，对于这两个库我也是接触得不深，所以借着这个机会也研究了一番。但是最终也有些地方搞的不是很清楚，好在影响不大。下面是几个关键的学习资源：

scikit-learn的官网，在这个上面可以找到官方文档，文档下载地址在这里
在这里查询numpy中的函数，在这个页面查询还是很方便的。
在这里查询matplotlib.pyplot的函数，把结果很好地表现出来也是很有必要的，matplotlib可以很好的胜任这个工作。前面的链接中，在页面右边有个Quick Search 很好用。

SVM的例子都有很多，我只能够挑几个来说，不是很重要的就直接写注释，稍微重要的再解释一番。

例子一

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import svmxx, yy = np.meshgrid(np.linspace(-3, 3, 500),np.linspace(-3, 3, 500))#linspace 表示线性地生成ndarray，前两个参数表示起始区间，第三个表示生成的元素个数#类似的是arange，不过arange的第三个参数表示步长#meshgrid,用于生成坐标，不过例子中一般是用来画图的所以我们先只考虑二维的情况#meshgrid(a,b)返回的xx,yy，xx的每一行都是向量a，重复len(b)次，yy的每一列都是向量b，重复len(a)次np.random.seed(0)X = np.random.randn(300, 2)#randn用于生成标准正态分布的数据，里面的两个数表示生成矩阵的大小Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)#求异或# fit the modelclf = svm.NuSVC()#生成一个NuSVC的estimator不过还没有进行训练clf.fit(X, Y)#使用数据进行训练，大多数estimator都有个fit函数# plot the decision function for each datapoint on the gridZ = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)#ravel()的作用是把多维数组拉伸成一维数组，c_的作用是将两个ndarray的相同位置处的#元素合在一起，对于这里的情况就是合并出一个坐标来，例子：#>>np.c_[np.array([1,2,3]), np.array([4,5,6])]#>>array([[1, 4],#       [2, 5], #      [3, 6]]) #注意c_后面直接跟的是[] #decision_function返回的是点到超平面的有向距离（带符号的）#最后reshape(（tuple）)的作用是再把一维数组变成括号里面的元组的大小plt.imshow(Z, interpolation='nearest',extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',origin='lower', cmap=plt.cm.PuOr_r)#画图，cmap参数决定是用什么样的颜色风格contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,linetypes='--')#contour画的是“云图”的线，我理解的就是等高线，其中levels参数，A list of floating point numbers indicating the level curves to draw, in increasing order; e.g., to draw just the zero contour pass levels=[0]。#这个数值我理解的就是在SVM确定后wx+b=0这个等式右边的值，0就代表超平面，+1和-1代表经过支持向量的平面，这个经过试验我发现应该是没错的。plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,edgecolors='k')#画散点图，对于其中的参数c，我没看懂API的解释（c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). ）但是经过试验发现不能把不同类别的点标记成不同颜色了。plt.xticks(())plt.yticks(())plt.axis([-3, 3, -3, 3])plt.show()

下面看一下NuSVC，倒不是说它很重要，只是因为它是第一个例子，并且有些参数在SVM模块的各种estimator中都是类似

class sklearn.svm.NuSVC(nu=0.5, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
Nu-Support Vector Classification.
Similar to SVC but uses a parameter to control the number of support vectors.
The implementation is based on libsvm.

SVC的含义是Support Vector Classification
下面介绍一下上面的主要参数：
nu:float, optional (default=0.5)
An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1].
nu是训练误差的上界，支持向量（应该是使用的支持向量）的下界，按照这个说法，nu越小训练误差也就越小。

kernel:string, optional (default=’rbf’)
Speciﬁes the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
代表你选择什么核函数，默认的’rbf’代表高斯核，’poly’代表多项式核

degree:int, optional (default=3)
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
只有多项式核才有度数这一选项，也就是括号外面的几次方

gamma：float, optional (default=’auto’)
Kernel coefﬁcient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead.
gamma可以说是很重要的一个参数，它决定了单个训练样本有多大的影响。在实际使用中rbf和原来的高斯核函数还有所不同，scikit-learn中的rbf函数为：

e x p (- g a m m a | u - v | 2)

gamma会影响特征空间的分布，对于rbf，gamma越大，支持向量越少，支持向量的个数一项训练与预测的速度。另外一提，多项式核的形式为：

(g a m m a * u * v + c o e f 0) d e g r e e

class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes
are supposed to have weight one. The “balanced” mode uses the values of y to auto-
matically adjust weights inversely proportional to class frequencies as n_samples /
(n_classes * np.bincount(y))
这个参数是对于两种数量相差过大的数据来说，可以给一个数据较少的类一个较大的权重比如
class_weight = {1:10}，其中1代表数据类别，10代表权重。

其他的一些参数应该都不是很重要，需要的话可以再查阅官方文档里面的API。
最后输出结果应该如下;
这里写图片描述
我是直接截的文档里面的图，整个问题就是一个异或问题，使用线性SVM是无法解决的，这里用的是高斯核，图中的线就是超平面，也叫做决策平面。

例子二：不同核函数对比

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import svmX=np.c_[(.4,-.7),(-1.5,-1),(-1.4,-.9),(-1.3,-1.2),(-1.1,-.2),(-1.2,-.4),(-.5,1.2),(-1.5,2.1),(1,1),         (1.3,.8),(1.2,.5),(.2,-2),(.5,-2.4),(.2,-2.3),(0,-2.7),(1.3,2.1)].T#这里的c_把各个元组变成了[]，整个X变成了矩阵Y = [0]*8+[1]*8#Y是一个list，8个0，8个1fignum = 1for kernel in ('linear','poly','rbf'):#这是对三种核分别计算并画图    clf = svm.SVC(kernel =kernel, gamma=2)    #SVC的介绍见下方    clf.fit(X,Y)    plt.figure(fignum, figsize=(4,3))    #画图区域    plt.clf()    plt.scatter(clf.support_vectors_[:,0], clf.support_vectors_[:,1],s=80,               facecolors ='none', zorder = 10, edgecolors = 'k')    #画出支持向量    plt.scatter(X[:,0], X[:,1],c=Y, zorder=10, cmap=plt.cm.Paired, edgecolors = 'k')    plt.axis('tight')    x_min = -3    x_max = 3    y_min = -3    y_max = 3    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]    #同样也是用来生成网格的与meshgrid类似，x_min:x_max:200j用于生成array，好像只能和mgrid连用,这点需要注意    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])    Z = Z.reshape(XX.shape)    plt.figure(fignum, figsize=(4,3))    plt.pcolormesh(XX, YY, Z>0, cmap = plt.cm.Paired)    #pcolormesh:Plot a quadrilateral mesh.参数C may be a masked array    plt.contour(XX, YY,Z, colors = ['k','k','k'], linestyles=['--','-','--'], levels = [-.5,0,.5])    #这里画出了三条线，分别是wx+b等于-0.5，0，0.5三种    plt.xlim(x_min, x_max)    plt.ylim(y_min, y_max)    plt.xticks(())    plt.yticks(())    fignum = fignum+1plt.show()

SVC：class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
C-Support Vector Classiﬁcation.
The implementation is based on libsvm. The ﬁt time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.
从上面的名字就可以看出，SVC有一个很关键的参数C：float, optional (default=1.0)Penalty parameter C of the error term.，其实就是目标函数中的惩罚项前面的系数即

1 2 ∥ w ∥ 2 + C \sum i = 1 N ξ i

其中

ξi为每个样本的惩罚，即正确分类的程度，为0证明被正确分类。较小的C可以让决策平面更加平滑，较大的C倾向于使所有的样本都能够正确的分类

decision_function_shape：‘ovo’, ‘ovr’, default=’ovr’
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classiﬁers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2).
这个参数代表决策函数的类型，其实就是对于多类型的分类情况，应该采取什么策略。常用的就是’ovo’：one-vs-one,两两分类，共(2n)种，然后分别计算输出的类别，最后投票决定结果。’ovr’:one-vs-rest，将一类和剩下的所有分类，共n类。然后计算输出。OVR的训练开销较小。目前这个阶段还不怎么管这个参数。
另外在scikit-learn 中没有的一种多分类方式为MVM，一种常见的MVM技术叫做“输出纠错码技术”（ECOC）。通过编码，解码来预测类别。下面是输出结果：
这里写图片描述
这是线性核的输出结果

这是高斯核的输出结果

这是多项式核的输出结果

例子三

这个例子探究了gamma和C这两个重要的参数对于使用RBF核的SVM的影响情况。

import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import Normalizefrom sklearn.svm import SVCfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import load_irisfrom sklearn.model_selection import StratifiedShuffleSplitfrom sklearn.model_selection import GridSearchCV# Utility function to move the midpoint of a colormap to be around# the values of interest.class MidpointNormalize(Normalize):def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):    self.midpoint = midpoint    Normalize.__init__(self, vmin, vmax, clip)def __call__(self, value, clip=None):    x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]    return np.ma.masked_array(np.interp(value, x, y))# ############################################################################## Load and prepare data set## dataset for grid searchiris = load_iris()X = iris.datay = iris.target# Dataset for decision function visualization: we only keep the first two# features in X and sub-sample the dataset to keep only 2 classes and# make it a binary classification problem.X_2d = X[:, :2]#只取前两个特征X_2d = X_2d[y > 0]y_2d = y[y > 0]y_2d -= 1# It is usually a good idea to scale the data for SVM training.# We are cheating a bit in this example in scaling all of the data,# instead of fitting the transformation on the training set and# just applying it on the test set.scaler = StandardScaler()#用于将数据归一化，也就是变成标准正态分布的数据，这只是一个对象X = scaler.fit_transform(X)#fit_trans_form 把数据归一化X_2d = scaler.fit_transform(X_2d)# ############################################################################## Train classifiers## For an initial search, a logarithmic grid with basis# 10 is often helpful. Using a basis of 2, a finer# tuning can be achieved but at a much higher cost.C_range = np.logspace(-2, 10, 13)gamma_range = np.logspace(-9, 3, 13)param_grid = dict(gamma=gamma_range, C=C_range)cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)#GridSearchCV的介绍见下方grid.fit(X, y)print("The best parameters are %s with a score of %0.2f"% (grid.best_params_, grid.best_score_))# Now we need to fit a classifier for all parameters in the 2d version# (we use a smaller set of parameters here because it takes a while to train)C_2d_range = [1e-2, 1, 1e2]gamma_2d_range = [1e-1, 1, 1e1]classifiers = []for C in C_2d_range:    for gamma in gamma_2d_range:        clf = SVC(C=C, gamma=gamma)        clf.fit(X_2d, y_2d)        classifiers.append((C, gamma, clf))#这里只是一个小数据集# ############################################################################## Visualization## draw visualization of parameter effectsplt.figure(figsize=(8, 6))xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))for (k, (C, gamma, clf)) in enumerate(classifiers):# evaluate decision function in a grid    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])    Z = Z.reshape(xx.shape)# visualize decision function for these parameters    plt.subplot(len(C_2d_range), len(gamma_2d_range), k + 1)    plt.title("gamma=10^%d, C=10^%d" % (np.log10(gamma), np.log10(C)),size='medium')# visualize parameter's effect on decision function    plt.pcolormesh(xx, yy, -Z, cmap=plt.cm.RdBu)    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_2d, cmap=plt.cm.RdBu_r,edgecolors='k')    plt.xticks(())    plt.yticks(())    plt.axis('tight')scores = grid.cv_results_['mean_test_score'].reshape(len(C_range),len(gamma_range))## Draw heatmap of the validation accuracy as a function of gamma and C## The score are encoded as colors with the hot colormap which varies from dark# red to bright yellow. As the most interesting scores are all located in the# 0.92 to 0.97 range we use a custom normalizer to set the mid-point to 0.92 so# as to make it easier to visualize the small variations of score values in the# interesting range while not brutally collapsing all the low score values to# the same color.plt.figure(figsize=(8, 6))plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,norm=MidpointNormalize(vmin=0.2, midpoint=0.92))plt.xlabel('gamma')plt.ylabel('C')plt.colorbar()plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)plt.yticks(np.arange(len(C_range)), C_range)plt.title('Validation accuracy')plt.show()

ShuffleSplit:class sklearn.model_selection.ShuffleSplit(n_splits=10, test_size=’default’,
train_size=None, random_state=None)
Random permutation cross-validator
Yields indices to split data into training and test sets.
Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different,
although this is still very likely for sizeable datasets.
作用就是随机地划分数据集，用于交叉验证。
主要参数：
n_splits : int, default 10
就是划分多少次
train_size:训练集的比例

GridSearchCV:sklearn.model_selection.GridSearchCV(estimator,param_grid, scoring=None,fit_params=None, n_jobs=1, iid=True,reﬁt=True, cv=None, verbose=0,
pre_dispatch=‘2*n_jobs’, error_score=’raise’,return_train_score=’warn’)
Exhaustive search over speciﬁed parameter values for an estimator.
Important members are fitt, predict.
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
主要的作用就是在param_grid（参数网格，姑且这么翻译）上进行搜索，然后根据的分多少找出最优参数。所以显然很重要的几个参数就是：

estimator：就是用来评价的分类器（或者直译叫预测器？）。所有的estimator实例都实现了estimator 接口，所以应该提供一个scoring函数。这个所谓的分数怎么来的我也不清楚，反正是越高越好。
param_grid：dict or list of dictionaries
Dictionary with parameters names (string) as keys and lists of parameter settings to try
as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.
这个参数应该是字典或者是字典的list，字典中参数名为key，参数值的list作为值。
scoring=None string, callable, list/tuple, dict or None, default: None
就是用来打分的，可以调用其他地方的，比如sklearn.metrics里面的用来打分的函数。但是默认参数就是None，调用的是estimator自身的scoring函数。
cv int, cross-validation generator or an iterable, optional
cv应该就是来自于cross validation，也就是交叉验证。也就是为了充分对模型进行评估所使用的。

最后生成的对象有很多属性，这里只写两个：
cv_results:dict of numpy (masked) ndarrays
就是最后结果的一个汇总报告
**best_estimator_**estimator or dict
就是最后找出来的最好的estiamtor。

下面是最后的结果
这里写图片描述
这里再讨论一下gamma和C两个参数的作用。
gamma：gamma可以看作支持向量影响半径的相反数，所以gamma越小代表着支持向量的影响范围越大。相反，gamma越大支持向量的影响越小。如果gamma特别大，所有的支持向量所影响的范围就只包含支持向量自己这个范围，从上图的第一个图也可以看出，当gamma为10的时候，决策平面基本就只围着向量自己了。并且这个时候调整C也没有用了，从上图的第二张图也能够看出来，gamma大于10的时候，不仅准确率较低，随着C的变化准确率也基本不变化了。
C：C越小决策平面越光滑，因为对误分类的惩罚较小，C越大越倾向于精确地分类，并且此时有更多的自由去选择更多的向量作为支持向量。所以图中可以看出，gamma一定的情况下，C越大，决策平面越复杂。
最后可以看出分数较高，也就是准确率较高的参数组合基本都在对角线上。
最佳的参数为：
The best parameters are {‘C’: 1.0, ‘gamma’: 0.10000000000000001} with a score of 0.97

综上，就是从scikit-learn的官方文档里面挑出来的三个例子，里面还有不少SVM的例子，感兴趣的可以看一下，其实哪怕我看了一遍也不敢说自己都懂了，还是得找机会实践一下。

阅读全文

1 0