学习笔记：MNIST数据集

来源：互联网发布：网络摄像头的安装方法编辑：程序博客网时间：2024/06/06 02:08

数据预览

sklearn中内置了从开放数据库中取得数据集的方法，在线导入MNIST数据集。

from sklearn.datasets import fetch_mldatamnist = fetch_mldata('MNIST original')mnist

数据是以字典格式存储的，其中’data’键值对应的为数据，’target’键值对应的为样本标签，接下来将数据与标签提取出来。

X,Y=mnist["data"],mnist["target"]print(X.shape,Y.shape)

输出为：(70000, 784) (70000,)。共有70000个样本，每个样本具有784个特征，实际上MNIST的每个样本为一张图片，这是28*28的像素特征。下面随便选取一个样本进行可视化。

numpy.reshape() : Gives a new shape to an array without changing its data.

matplotlib.pyplot.imshow() : Display an image on the axes.

Parameters:
- X : array_like, shape (n, m) or (n, m, 3) or (n, m, 4).
- cmap : Colormap.
- interpolation : Acceptable values are ‘none’, ‘nearest’, ‘bilinear’, ‘bicubic’, ‘spline16’, ‘spline36’, ‘hanning’, ‘hamming’, ‘hermite’, ‘kaiser’, ‘quadric’, ‘catrom’, ‘gaussian’, ‘bessel’, ‘mitchell’, ‘sinc’, ‘lanczos’.

%matplotlib inlineimport matplotlibfrom matplotlib import pyplot as pltindex=36000        #随便选一个样本some_digit=X[index]print(type(some_digit))some_digit_image=some_digit.reshape(28,28)plt.imshow(some_digit_image,cmap=matplotlib.cm.binary,interpolation="nearest")plt.axis("off")        #关闭坐标轴plt.show()print(Y[index])

输出为5.0。

划分数据集

MNIST数据集是已经划分好训练集与测试集的，直接提取然后打乱顺序即可。注意测试集不需要打乱顺序。

numpy.random.permutation() : Randomly permute a sequence, or return a permuted range. If x is a multi-dimensional array, it is only shuffled along its first index.

X_train,X_test,Y_train,Y_test=X[:60000],X[60000:],Y[:60000],Y[60000:]import numpy as npshuffle_index=np.random.permutation(60000)X_train,Y_train=X_train[shuffle_index],Y_train[shuffle_index]

二分类任务

首先考虑一个最简单的二分类问题，训练一个SGD线性分类器来判断图片中的数字是不是“5”。模型的评判标准以F1分数为准。

sklearn.model_selection.cross_val_score() : Evaluate a score by cross-validation.

Parameters:
- estimator : The object to use to fit the data.
- X : The data to fit.
- y : The target variable to try to predict in the case of supervised learning.

Y_train_bin=(Y_train==5)Y_test_bin=(Y_test==5)from sklearn.linear_model import SGDClassifiersgd_clf=SGDClassifier(random_state=42)sgd_clf.fit(X_train,Y_train_bin)from sklearn.model_selection import cross_val_scorecross_val_score(sgd_clf,X_train,Y_train_bin,cv=3,scoring="f1")

输出为：array([ 0.7653012 , 0.78112633, 0.76090226])

模型评估

准确率/召回率权衡

准确率指的是模型预测的正例中有多少是对的，而召回率指的是整个数据集中的正例有多少被模型找出来了，也称查全率。一般来说，准确率和召回率是互相矛盾的两个量。

sklearn.model_selection.cross_val_predict() : Generate cross-validated estimates for each input data point.

Parameters:
- method : string, optional.当选项为”decision_function”时会返回每个样本的得分。

sklearn.metrics.precision_recall_curve() : Compute precision-recall pairs for different probability thresholds.This implementation is restricted to the binary classification task.

Parameters:
- y_true : True targets of binary classification in range {-1, 1} or {0, 1}.
- probas_pred : Estimated probabilities or decision function.

Returns:
- precision
- recall
- thresholds，当模型给某一样本打出的分数高于此值则会判为正例

Note: there is an issue introduced in Scikit-Learn 0.19.0 where the result of cross_val_predict() is incorrect in the binary classification case when using method="decision_function", as in the code above. The resulting array has an extra first dimension full of 0s. We need to add this small hack for now to work around this issue.

from sklearn.model_selection import cross_val_predictY_scores=cross_val_predict(sgd_clf,X_train,Y_train_bin,cv=3,method="decision_function")        #针对每个样本给出预测值# hack to work around issue #9589 introduced in Scikit-Learn 0.19.0if Y_scores.ndim == 2:    Y_scores = Y_scores[:, 1]from sklearn.metrics import precision_recall_curveprecisions,recalls,thresholds=precision_recall_curve(Y_train_bin,Y_scores)

def plot_precision_recall_vs_thresholds(precisions,recalls,thresholds):    #注意此处绘制时precisions与recalls均不包含最后一个值    plt.plot(thresholds,precisions[:-1],"b--",label="Precison")    plt.plot(thresholds,recalls[:-1],"g-",label="Recall")    plt.xlabel("Threshould")    plt.legend(loc="best")    plt.axis([-650000,650000,0,1])plot_precision_recall_vs_thresholds(precisions,recalls,thresholds)plt.show()

召回率是严格关于阈值的减函数，但是准确率在一定范围内是关于阈值的增函数，但并不是阈值的全局增函数。

准确率与召回率的关系。

def plot_recall_vs_precision(recalls,precisions):    plt.plot(recalls,precisions,"k")    plt.xlabel("Recall")    plt.ylabel("Precision")    plt.axis([0,1,0,1])plot_recall_vs_precision(recalls,precisions)plt.show()

ROC曲线

另一种评估模型的方法是计算模型ROC曲线下的面积，ROC曲线评估的是假正例率与真正例率。

sklearn.metrics.roc_curve() : Compute Receiver operating characteristic (ROC).This implementation is restricted to the binary classification task.

Returns:
- fpr
- tpr
- thresholds

from sklearn.metrics import roc_curvefpr,tpr,thresholds=roc_curve(Y_train_bin,Y_scores)

def plot_roc_curve(fpr,tpr,label=None):    plt.plot(fpr,tpr,linewidth=2,label=label)    plt.axis([0,1,0,1])    plt.xlabel("False Positive Rate")    plt.ylabel("True Positive Rate")plot_roc_curve(fpr,tpr)plt.show()

更换模型

使用SGD的线性模型表现并不是很好，更换集成模型中的随机森林试试。

sklearn.model_selection.cross_val_predict() : Generate cross-validated estimates for each input data point.

from sklearn.ensemble import RandomForestClassifierforest_clf=RandomForestClassifier(random_state=42)Y_probas_forest=cross_val_predict(forest_clf,X_train,Y_train_bin,cv=3,method="predict_proba")Y_scores_forest=Y_probas_forest[:,1]fpr_forest,tpr_forest,threshold_forest=roc_curve(Y_train_bin,Y_scores_forest)

对比线性模型与随机森林的ROC曲线。

plt.plot(fpr,tpr,"b:",label="sgd")plot_roc_curve(fpr_forest,tpr_forest,"Random Forest")plt.legend(loc="best")plt.show()

多分类任务

要完成多分类任务，最简单的方法就是训练多个二分分类器，然后用这些二分分类器来完成多分类任务。但实际上有些分类器，如随机森林分类器，直接就能够处理多分类任务，不需要生成额外的分类器。

from sklearn.ensemble import RandomForestClassifierforest_clf=RandomForestClassifier(random_state=42)forest_clf.fit(X_train,Y_train)forest_clf.predict_proba([some_digit])

输出为：array([[ 0. , 0. , 0. , 0.1, 0. , 0.9, 0. , 0. , 0. , 0. ]])。模型在数字’5’对应的位置给出了最高的预测概率，与实际相符。

训练多个分类器的策略又分为两种，一对一(OvO)与一对多(OvA)。OvO策略会生成n∗(n−1)/2个二分分类器来一一对决，以最后胜出者的分类为准；而OvA策略只会生成n个二分分类器，取得分最高的分类器来进行分类。

sklearn会模型训练时自动完成多分类器的创建，一般情况下会默认使用OvA策略，但是对于SVM分类器会使用OvO策略。

sgd_clf.fit(X_train,Y_train)       #OvA策略sgd_clf.classes_

输出为：array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

如果需要指定创建多个分类器的策略，可以使用以下方法。

from sklearn.multiclass import OneVsOneClassifierovo_clf=OneVsOneClassifier(SGDClassifier(random_state=42))ovo_clf.fit(X_train,Y_train)print(len(ovo_clf.estimators_))

输出为45。

模型评估

多个分类器无法像单个分类器那样地使用F1分数来进行评测，这里仅使用准确率来评估模型。

cross_val_score(sgd_clf,X_train,Y_train,cv=3,scoring="accuracy")

输出为：array([ 0.85952809, 0.859943 , 0.88363254])。

对输入进行特征缩放处理，就能明显地提升模型的表现。

from sklearn.preprocessing import StandardScalerscaler=StandardScaler()X_train_scaled=scaler.fit_transform(X_train.astype(np.float64))cross_val_score(sgd_clf,X_train_scaled,Y_train,scoring="accuracy")

误差分析

混淆矩阵中的每一行代表真实标记，每一列代表预测结果。

from sklearn.metrics import confusion_matrixY_train_pred=cross_val_predict(sgd_clf,X_train_scaled,Y_train,cv=3)conf_mx=confusion_matrix(Y_train,Y_train_pred)print(conf_mx)

plt.matshow(conf_mx,cmap=plt.cm.gray)      #灰度图,对应位置的值越大色块越亮plt.show()

对混淆矩阵稍微处理一下：将非对角线元素除以每行的样本总数以得到相对错误率，然后将对角线元素全部变更为0，暂时不需要被正确预测的样本。

row_sum=conf_mx.sum(axis=1,keepdims=True)norm_conf_mx=conf_mx/row_sumnp.fill_diagonal(norm_conf_mx,0)plt.matshow(norm_conf_mx,cmap=plt.cm.gray)plt.show()

可以看到图中最亮的两个块为第四行第六列与第六行第四列，即“3”与“5”这两个数字经常会被混淆。

a3p5=X_train[(Y_train==3)&(Y_train_pred==5)]a5p3=X_train[(Y_train==5)&(Y_train_pred==3)]print(len(a3p5),len(a5p3))

输出为：227 194，模型把227个’3’预测成了’5’，而把194个’5’预测成了’3’，来看一下这些被错误分类的图片。plot_digits()的实现见文末附录。

plt.figure(figsize=(8,8))plt.subplot(221)plot_digits(a3p5[:25],images_per_row=5)plt.subplot(222)plot_digits(a5p3[:25],images_per_row=5)plt.show()

左边是被误判为“5”的“3”，而右边是被误判为“3”的“5”，可以看到这些数字的书写非常潦草，个别样本人眼都很难将其区分。

KNN模型

训练一个KNN模型来完成MNIST数据集的分类任务，其最优参数使用网格搜索来确定。KNN模型的训练非常慢，此处网格搜索的参考时间达到了23个小时。

from sklearn.neighbors import KNeighborsClassifier#捷径knn_clf=KNeighborsClassifier(n_neighbors=4,weights='distance')knn_clf.fit(X_train,Y_train)#实际上应该运行下面的代码# knn_clf=KNeighborsClassifier()# from sklearn.model_selection import GridSearchCV# param_grid=[#     {'n_neighbors':[3,4,5],'weights':['uniform','distance']}# ]# grid_search=GridSearchCV(knn_clf,param_grid,cv=5,scoring='accuracy',verbose=2, n_jobs=4)# grid_search.fit(X_train,Y_train)cross_val_score(sgd_clf,X_train,Y_train,cv=3,scoring="accuracy")

输出为：array([ 0.97160568, 0.97449872, 0.9713457 ])，查看模型在测试集上的性能。

from sklearn.metrics import accuracy_scoreY_knn_pred=knn_clf.predict(X_test)accuracy_score(Y_test,Y_knn_pred)

输出为：0.97140000000000004。

数据扩展

对于图片型数据，可以通过将图片移动或翻转生成的新图片加入到数据集中，实现数据集的扩充，这种方法称为数据扩展。

from scipy.ndimage.interpolation import shift#平移以ndarray格式存储的图片格式def shift_ndarray(ndarray,dx,dy):    image=ndarray.reshape((28,28))    shifted_image=shift(image,[dx,dy])    return shifted_image.reshape([-1])

#ndarray列表化X_train_ex=[ndarray for ndarray in X_train]Y_train_ex=[label for label in Y_train]#移动一个像素单位top,down,left,right=(1, 0), (-1, 0), (0, 1), (0, -1)for [dx,dy] in (top,down,left,right):    for ndarray,label in zip(X_train,Y_train):        X_train_ex.append(shift_ndarray(ndarray,dx,dy))        Y_train_ex.append(label)#列表ndarray化X_train_ex=np.array(X_train_ex)Y_train_ex=np.array(Y_train_ex)shuffle_idx=np.random.permutation(len(X_train_ex))X_train_ex=X_train_ex[shuffle_idx]Y_train_ex=Y_train_ex[shuffle_idx]

数据扩充后的模型拟合需要一段较长的时间。

from sklearn.neighbors import KNeighborsClassifierknn_clf=KNeighborsClassifier(n_neighbors=4,weights='distance')knn_clf.fit(X_train_ex,Y_train_ex)Y_knn_pred=knn_clf.predict(X_test)from sklearn.metrics import accuracy_scoreacc=accuracy_score(Y_test,Y_knn_pred)print(acc)

输出为：0.9763。

单样本多标签分类任务

有时一个样本具有多种属性，我们需要将其具有的多种属性都预测出来。比如“5”这个数字，它是奇数，同时又小于“7”，若需要同时预测数字的这两种属性，这种任务被称为多标签分类任务。

Y_train_lager=(Y_train>=7)Y_train_odd=(Y_train%2==0)Y_multilabel=np.c_[Y_train_lager,Y_train_odd]from sklearn.neighbors import KNeighborsClassifierknn_clf=KNeighborsClassifier()knn_clf.fit(X_train_scaled,Y_multilabel)knn_clf.predict([some_digit])

输出为：array([[False, True]], dtype=bool)

附录

def plot_digits(instances, images_per_row=10, **options):    size = 28    images_per_row = min(len(instances), images_per_row)    images = [instance.reshape(size,size) for instance in instances]    n_rows = (len(instances) - 1) // images_per_row + 1    row_images = []    n_empty = n_rows * images_per_row - len(instances)    images.append(np.zeros((size, size * n_empty)))    for row in range(n_rows):        rimages = images[row * images_per_row : (row + 1) * images_per_row]        row_images.append(np.concatenate(rimages, axis=1))    image = np.concatenate(row_images, axis=0)    plt.imshow(image, cmap = matplotlib.cm.binary, **options)    plt.axis("off")

阅读全文

0 0