机器学习之scikit-learn初识

来源：互联网发布：电子签章软件编辑：程序博客网时间：2024/06/03 23:48

上了一门机器学习课，实践平台老师推荐了Python和scikit-learn库。scikit-learn库包含有完善的文档和丰富的机器学习算法，在官方文档上每种算法都有讲解和应用示例（简直堪比老师课上的PPT）。

于是调查了一下这个库，目的是学习下它怎么用。

数据加载

第一步自然是数据加载，可以在UCIMachine Learning Repository网站上load，这个网站是个公开的机器学习数据集库，资源来自各种学校各种单位各种实验室各种数据库的贡献。数据集都不大，可以用来练习ML算法。

python是个强大的东西，我们可以直接用urllib从网站上load数据，再用numpy的函数加载：

（这里下的是经典的鸢尾花数据集，共150个data，分3类，每类50例，每例4个属性和1个类别标签）

import numpy as npfrom urllib import request# UCI dataset urlurl = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"raw_data = request.urlopen(url)x = np.loadtxt(raw_data, delimiter=",", usecols=(0,1,2,3))raw_data = request.urlopen(url)y = np.loadtxt(raw_data, delimiter=",", usecols=(4), dtype=str)

注：raw_data是网页请求的response内容，只能读取一次，所以y要再request一次。若想一次读取就要把data存到本地文件再加载。

# UCI dataset urlurl = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"raw_data = request.urlopen(url)page = raw_data.read()page = page.decode('utf-8')localFile = open('iris.csv','w')localFile.write(page)localFile.close()x = np.loadtxt('iris.csv', delimiter=',', usecols=(0,1,2,3))y = np.loadtxt('iris.csv', delimiter=',', usecols=(4), dtype=str)

另外，scikit-learn里也封装了加载lris这个经典数据集的方法，可以看下是怎样的，可是它封装得太好了所以并没有通用性：

from sklearn.datasets import load_irisiris = load_iris()x, y = iris.data, iris.target

数据标准化

大多数的梯度算法对数据的缩放都很敏感，所以在进行算法之前先对数据集做标准化（Normalization，让这些独立的样本具有统一的范数），或者规范化（Standardization，使数据所有的特征都有0期望和1方差）。

from sklearn import preprocessing# normalize the data attributesnormalized_x = preprocessing.normalize(x)# standardize the data attributesstandardized_x = preprocessing.scale(x)

特征选取

机器学习很重要的一步是数据特征的选取，它对算法应用的效果影响较大。

• 基于L1范数的特征提取（L1-based feature selection）：

from sklearn.svm import LinearSVCfrom sklearn.feature_selection import SelectFromModel...lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(x, y)model = SelectFromModel(lsvc, prefit=True)x_new = model.transform(x)

x是(150,4)，x_new是(150,3)。

• 树算法提取特征（Tree-based feature selection）：

from sklearn.ensemble import ExtraTreesClassifierfrom sklearn.feature_selection import SelectFromModel...clf = ExtraTreesClassifier()clf = clf.fit(x, y)print(clf.feature_importances_)model = SelectFromModel(clf, prefit=True)x_new = model.transform(x)

x是(150,4)，x_new是(150,2)。

• 递归特征消除算法（RFE，recursive feature elimination）提取特征，对特征集的搜索找到最好的子集：

from sklearn.feature_selection import RFEfrom sklearn.linear_model import LogisticRegression...model = LogisticRegression()# create the RFE model and select 3 attributesrfe = RFE(model, 3)rfe = rfe.fit(x, y)# summarize the selection of the attributesprint(rfe.support_)print(rfe.ranking_)

一些机器学习算法

逻辑回归（Logistic regression）

著名机器学习算法，逻辑回归大多数情况下被用来解决分类问题（二元分类），但多分类也适用。这个算法的优点是对于每一个输出的对象都有一个对应类别的概率。

from sklearn import metricsfrom sklearn.linear_model import LogisticRegression...model = LogisticRegression()model.fit(x, y)# make predictionsexpected = ypredicted = model.predict(x)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

朴素贝叶斯（naive bayes）

它也是最有名的机器学习算法之一，它的主要任务是恢复训练样本的数据分布密度。这个方法通常在多类的分类问题上表现的很好。

from sklearn import metricsfrom sklearn.naive_bayes import GaussianNB...model = GaussianNB()model.fit(x, y)# make predictionsexpected = ypredicted = model.predict(x)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

K-最邻近（k-nearest neighbor）

knn方法通常用于更复杂分类算法中，比如用它的估计值做为一个对象的特征。此外，这个算法在回归问题中通常表现出最好的质量。

from sklearn import metricsfrom sklearn.neighbors import KNeighborsClassifier...model = KNeighborsClassifier()model.fit(x, y)# make predictionsexpected = ypredicted = model.predict(x)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

决策树（Decision Trees）

分类和回归树（CART）经常被用于这么一类问题，在这类问题中对象有可分类的特征且被用于回归和分类问题。决策树很适用于多类分类。

from sklearn import metricsfrom sklearn.tree import DecisionTreeClassifier...model = DecisionTreeClassifier()model.fit(x, y)# make predictionsexpected = ypredicted = model.predict(x)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

支持向量机（Support vector machine）

SVM是最流行的机器学习算法之一，它主要用于分类问题，可以实现多类分类。同样也适用于逻辑回归。

from sklearn import metricsfrom sklearn.svm import SVC...model = SVC()model.fit(x, y)# make predictionsexpected = ypredicted = model.predict(x)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

除了分类和回归，scikit-learn还有很多更复杂的算法，包括聚类、建立混合算法如Bagging和Boosting。

优化算法的参数

没有经验的情况下，scikit-learn也有提供函数寻找模型最优的参数。举个规则化参数选择的例子，可以在给定区间遍历数值，而有时随机取数效果更好：

import numpy as npfrom sklearn.linear_model import Ridgefrom sklearn.grid_search import GridSearchCV...# prepare a range of alpha values to testalphas = np.array([1,0.1,0.01,0.001,0.0001,0])# create and fit a ridge regression model, testing each alphamodel = Ridge()search = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))search.fit(x, y)# 比较参数search的结果print(search.best_score_)print(search.best_estimator_.alpha)

用search.best_score_找到最优参数alpha

import numpy as npfrom scipy.stats import uniform as sp_randfrom sklearn.linear_model import Ridgefrom sklearn.grid_search import RandomizedSearchCV...# prepare a uniform distribution to sample for the alpha parameterparam_grid = {'alpha': sp_rand()}# create and fit a ridge regression model, testing random alpha valuesmodel = Ridge()search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)search.fit(x, y)# 比较随机取参数下算法的效果print(search.best_score_)print(search.best_estimator_.alpha)

以上是scikit-learn库一些基础算法的简单应用，更多的算法还是要放到实际中问题中才好分析效果。

scikit-learn是一个方便又强大的机器学习算法库，是一个非常优秀的学习ML的平台，怪不得很多教案都推荐它。

参考：

[1] http://scikit-learn.org/stable/user_guide.html

阅读全文

0 0