机器学习笔记：初识sklearn(一)

来源：互联网发布：java中嵌套循环编辑：程序博客网时间：2024/05/21 01:28

以下内容为优达学城机器学习入门的mini项目：这里有一系列分别由Sara(label 0)与Chris(label 1)所写的邮件，划分数据集，使用sklearn中的集成模型进行训练与预测。

预处理

依赖库

import nltkimport numpyimport scipyimport timeimport sysimport sklearnfrom email_preprocess import preprocesssys.path.append("../tools/")

数据处理

email_preprocess.py：

import pickleimport _pickle as cPickleimport numpyfrom sklearn import model_selectionfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_selection import SelectPercentile, f_classifdef preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):    """         this function takes a pre-made list of email texts and        the corresponding authors and performs        a number of preprocessing steps:            -- splits into training/testing sets (10% testing)            -- vectorizes into tfidf matrix            -- selects/keeps most helpful features        after this, the feaures and labels are put into numpy arrays        which play nice with sklearn functions        4 objects are returned:            -- training/testing features            -- training/testing labels    """    ### the words (features) and authors (labels), already largely preprocessed    authors_file_handler = open(authors_file, "rb")    authors = pickle.load(authors_file_handler)    authors_file_handler.close()    words_file_handler = open(words_file, "rb")    word_data = cPickle.load(words_file_handler)    words_file_handler.close()    ### 按照交叉验证法则划分数据集,test_size表示划分到测试集的百分比    features_train, features_test, labels_train, labels_test = \        model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)    ### text vectorization--go from strings to lists of numbers    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')    features_train_transformed = vectorizer.fit_transform(features_train)    features_test_transformed  = vectorizer.transform(features_test)    ### 特征选择。因为文本的特征数量非常多,只选取一部分特征    selector = SelectPercentile(f_classif, percentile=10)      #选取的特征百分比    selector.fit(features_train_transformed, labels_train)    features_train_transformed = selector.transform(features_train_transformed).toarray()    features_test_transformed  = selector.transform(features_test_transformed).toarray()    print("no. of Chris training emails:", sum(labels_train))    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))    return features_train_transformed, features_test_transformed, labels_train, labels_test

朴素贝叶斯

简介

贝叶斯概率就不用介绍了。

sklearn官方文档给了朴素贝叶斯中高斯模型的示例代码：

import numpy as npX = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])Y = np.array([1, 1, 1, 2, 2, 2])from sklearn.naive_bayes import GaussianNBclf = GaussianNB()clf.fit(X, Y)print(clf.predict([[-0.8, -1]]))

依赖库

from sklearn.naive_bayes import GaussianNB

使用

features_train, features_test, labels_train, labels_test = preprocess()clf=GaussianNB()tic=time.time()clf.fit(features_train,labels_train)toc=time.time()print("training time:{}s.".format(round(toc-tic,3)))accuracy=clf.score(features_test,labels_test)    #.score()方法用于评测模型准确度print("accuracy:{}".format(accuracy))

支持向量机

发现SVM这块优达学城讲的挺通俗的，低维到高位的映射就是通过增加组合特征值实现的，之前在《机器学习》这本书上没看懂，现在懂了。

简介

sklearn官方文档给出了SVM分类器的简单代码示例：

from sklearn import svmX = [[0, 0], [1, 1]]y = [0, 1]clf = svm.SVC()clf.fit(X, y)

其中SVC对象有几个重要的参数：C=1.0,gamma='auto', kernel='rbf'。

kernel指定所用的核函数，sklearn中内置的核函数有’linear’、’poly’、’rbf’、’sigmoid’、’precomputed’ 或者可调用的自定义核函数。参数C与gamma对’rbf’核的SVM影响较大。
C参数决定在绘制超平面时把多少样本考虑进去，越高的C值会使决策边界更复杂甚至过拟合，而越低的C值会使得决策边界更平滑。对线性核函数无影响。
gamma决定了单个样本在绘制决策边界时的影响范围有多大，高gamma值影响范围小，低gamma值影响范围大。gamma参数的值可看作是样本影响半径的倒数。

依赖库

from sklearn import svm

使用

features_train, features_test, labels_train, labels_test = preprocess()clf=svm.SVC(kernel='linear')#丢弃部分训练数据以加快拟合features_train=features_train[:int(len(features_train)/100)]labels_train=labels_train[:int(len(labels_train)/100)]tic=time.time()clf.fit(features_train,labels_train)toc=time.time()print("train time:{}".format(round(toc-tic,3)))acc=clf.score(features_test,labels_test)print(acc)

注意在拟合时只使用到了1%的训练数据，因为在处理文本时SVM的拟合速度要比朴素贝叶斯慢得多。实际运行中，线性SVM对完整训练集的拟合时间为2min左右，准确率为98%；而线性SVM对1%训练集的拟合时间在0.1s以内，准确率为88%。

rbf核函数参数C的优化

接下来将核函数换成rbf，来讨论参数C的优化问题。在程序中设定四个不同的C值1.0,、10.0、1000.0、10000.0，分别查看不同参数值模型的准确度，为了更快地看到优化效果，只使用1%的训练集：

features_train, features_test, labels_train, labels_test = preprocess()clf1=svm.SVC(kernel='rbf',C=1.)clf2=svm.SVC(kernel='rbf',C=10.)clf3=svm.SVC(kernel='rbf',C=1000.)clf4=svm.SVC(kernel='rbf',C=10000.)#丢弃部分训练数据以加快拟合features_train=features_train[:int(len(features_train)/100)]labels_train=labels_train[:int(len(labels_train)/100)]clf1.fit(features_train,labels_train)clf2.fit(features_train,labels_train)clf3.fit(features_train,labels_train)clf4.fit(features_train,labels_train)acc1=clf1.score(features_test,labels_test)acc2=clf2.score(features_test,labels_test)acc3=clf3.score(features_test,labels_test)acc4=clf4.score(features_test,labels_test)print(acc1,acc2,acc3,acc4)

采用四号分类器(C=10000)，使用整个训练集对clf4进行训练，最后准确度达到了：

决策树

简介

决策树是一种以线性划分做出非线性决策的算法，它以不同层次的限定条件来对数据进行线性划分，然后根据不同区域做出非线性决策。

熵

决策树中一个很重要的属性就是节点的熵，其计算公式为：

E n t = - \sum i P i log 2 (P i)

Pi表示第i类样本在此节点中所占的比例。举个例子，有四个节点

A1,

A2,

B1,

B2被划分到了同一个节点中，则此节点的

PA与

PB均为0.5，计算得到的节点熵为1，这说明节点所含的信息最不纯净(熵为0时表示最纯净)。

信息增益

很明显决策树需要依照不同特征来对数据进行划分，或者说对每一个节点进行进一步的划分，很明显最佳的决策就是在划分之后子节点的信息纯度尽量高，信息增益就是表示这一概念的量：

G a i n (p a r e n t, f e a t u r e) = E n t (p a r e n t) - \sum c h i l d w c h i l d E n t (c h i l d)

后一项为父节点划分后所有子节点的加权平均信息熵，权重

wchild为某一子节点包含的样本数占父节点样本数的比值。

假如有如下样本，三项二元特征grads、bumpiness与speed limit，二元类别为speed，样本数为4：

grads bumpiness speed limit speed steep bumpy yes slow steep smooth yes slow flat bumpy no fast steep smooth no fast

假设决策树先按照grads特征对数据进行划分，那么有：

则根据特征grads划分之后的信息增益计算过程如下。

父节点的熵为：

parent_ent=-((2/4)*math.log(2/4,2)+(2/4)*math.log(2/4,2))

显而易见右子节点的熵为0：

r_child_ent=0

计算左子结点的熵：

l_child_ent=-((2/3)*math.log(2/3,2)+(1/3)*math.log(1/3,2))

子节点的权重：

l_child_w=3/4r_child_w=1/4

划分后的信息增益为：

Gain=parent_ent-l_child_w*l_child_ent-r_child_w*r_child_ent

这样，分别计算按照不同特征划分样本后的信息增益，选择能得到最大信息增益的特征来进行划分。此例中最佳划分特征为speed limit，因为划分之后两子节点的熵都是0，信息增益为1，如下图所示：

示例

sklearn官方文档给出了决策树分类器的使用代码示例：

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)

决策树分类器有几个重要的参数：

criterion=’gini’：划分节点的依据，可接受值'entropy'来按照信息增益划分节点。
min_samples_split=2：节点的最小样本数，当节点包含的样本数大于这个值时就会继续划分下去。此值太小会导致过拟合。

依赖库

from sklearn import tree

使用

features_train, features_test, labels_train, labels_test = preprocess()clf=tree.DecisionTreeClassifier(criterion='entropy',min_samples_split=40)tic=time.time()clf.fit(features_train,labels_train)toc=time.time()print("training time:{}".format(round(toc-tic,3)))acc=clf.score(features_test,labels_test)print(acc)

不难得出，决策树模型的拟合时长是由数据的特征数量决定的，上述代码的拟合时长明显是无法接受的，所以有必要对特征数量进行削减。在email_preprocess.py文件中修改以下行，只保留前1%的特征：

selector = SelectPercentile(f_classif, percentile=1)

再次运行模型：

阅读全文

0 0