支持向量机－手写数字识别

来源：互联网发布：linux循环执行命令编辑：程序博客网时间：2024/04/28 13:26

支持向量机分类器：

决定分类直线位置的样本并不是所有的训练数据，而是其中对两个空间间隔最小的两个不同类别的数据点，把这种可以用来真正帮助决策最优贤行分类模型的数据点叫做“支持向量”。LR模型由于在训练过程中考虑了所有训练样本对于参数的影响，因此不一定能获得最佳的分类器。

本文使用支持向量机分类器处理sklearn内部集成的手写字体数字图片数据集。(sklearn中集成的手写体数字图像仅仅是https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits的测试数据集。)

Python源码

#coding=utf-8from sklearn.datasets import load_digits#-------------from sklearn.cross_validation import train_test_split#-------------#load data standardize modelfrom sklearn.preprocessing import StandardScaler#load SVM:LinearSVC which is based on Linear hypothesisfrom sklearn.svm import LinearSVC#-------------from sklearn.metrics import classification_report#-------------  store handwrite num datas on digitsdigits=load_digits()print 'Total dataset shape',digits.data.shape#-------------  data prepare#75% training set,25% testing setX_train,X_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.25,random_state=33)print 'training data shape',y_train.shapeprint 'testing data shape',y_test.shape#-------------  trainingss=StandardScaler()X_train=ss.fit_transform(X_train)X_test=ss.transform(X_test)#initialize LinearSVClsvc=LinearSVC()#training modellsvc.fit(X_train,y_train)#use trained model to predict testing dataset,and store the result on y_predicty_predict=lsvc.predict(X_test)#-------------  performance measureprint 'The Accuracy is',lsvc.score(X_test,y_test)print classification_report(y_test,y_predict,target_names=digits.target_names.astype(str))

Result：

Total dataset shape (1797, 64)
training data shape (1347,)
testing data shape (450,)
The Accuracy of Linear SVC is 0.953333333333
precision recall f1-score support
0 0.92 1.00 0.96 35
1 0.96 0.98 0.97 54
2 0.98 1.00 0.99 44
3 0.93 0.93 0.93 46
4 0.97 1.00 0.99 35
5 0.94 0.94 0.94 48
6 0.96 0.98 0.97 51
7 0.92 1.00 0.96 35
8 0.98 0.84 0.91 58
9 0.95 0.91 0.93 44
avg / total 0.95 0.95 0.95 450

R,P 和F1指标最先使用于二分类任务，在数字识别中有0-9共计10个类别，无法直接计算三个性能指标。通常逐一来进行计算：把其他的类别看作负样本，因此创造了十个二分类任务

SVM模型曾经在ML领域繁荣了很长一段时间，由于其精妙的模型假设，可以帮助在海量甚至更高维度的数据中，筛选对预测任务最为有效的少数数据样本。这样不仅节省了模型学习需要的数据内存，也提高了模型的预测性能。但如此的优势要付出更多的计算代价。实际使用该模型时候，需要权衡利弊，达成任务目标。

阅读全文

1 0