py scikit-learn 库

来源：互联网发布：mac上有什么好玩的网游编辑：程序博客网时间：2024/06/03 21:19

1.简介

python的一个机器学习框架.
使用anaconda的话, 自带就有.
官网地址
官网教程-tutorial

2.Bunch类

scikit-learn的package名字是sklearn, 它自带了一些数据集, 方便入门使用. 数据集通常用bunch这个类表示.

from sklearn import datasets;iris = datasets.load_iris()

通过上面的代码就得到了自带的iris数据集, 是关于鸢尾花的.
这里写图片描述
图2-1 Bunch类的对象-鸢尾花数据集
它的一些属性见下:

data
ndarray对象, shape为(150,4)
feature_names
花萼长度, 花萼宽度, 花瓣长度, 花瓣宽度. 这四个特征分别对应每个特征向量的一个维度.
target
类别, 用整数0, 1, 2 表示. 也是ndarray对象, shape为(150,)
target_names
类别说明, 鸢尾花是大类, 具体有 ‘setosa’ ‘versicolor’ ‘virginica’ 这三种.
DESCR
整个数据集的描述, 字符串, 给人看的.

2.1 自带数据集

general-dataset-api

load_boston([return_X_y]) Load and return the boston house-prices dataset (regression).
load_iris([return_X_y])Load and return the iris dataset (classification). 四种特征, 三个类别的鸢尾花.
load_diabetes([return_X_y]) Load and return the diabetes(糖尿病) dataset (regression).
load_digits([n_class, return_X_y]) Load and return the digits dataset (classification).
load_linnerud([return_X_y]) Load and return the linnerud dataset (multivariate regression).

3. 一般步骤

准备数据集

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

##3.1 训练
##3.2 模型持久化
##3.3 预测未知数据
##3.4 代码
代码.

from sklearn import datasetsfrom sklearn import svmfrom sklearn.externals import joblibdef predict_and_dispaly(clf, iris):    predict_result = []    for i in list(clf.predict(iris.data[:3])):        predict_result.append(iris.target_names[i])    print(predict_result)iris = datasets.load_iris()clf = svm.SVC()clf.fit(iris.data, iris.target)predict_and_dispaly(clf, iris)#serialization de-serializationjoblib.dump(clf, 'iris_model_persistence.pkl')clf2 = joblib.load('iris_model_persistence.pkl')predict_and_dispaly(clf2, iris)"""['setosa', 'setosa', 'setosa']['setosa', 'setosa', 'setosa']"""

4. metrics 用于评估

from sklearn import metrics
模型在测试集上的输出为 y_pred, 通过对比 y_true 与 y_pred , 便可得出模型好坏.

1. 二分类

2. 多分类

accuracy_score
正确率.
accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
有口径可得, accuracy_score = 1- zero_one_loss.
zero_one_loss
0-1 损失.
$f (y t, y p) = {10 y t \neq y p y t = y p$
sklearn.metrics.classification.zero_one_loss(y_true, y_pred, normalize=True, sample_weight=None)

3.回归

阅读全文

0 0