机器学习sklearn之菜鸟入门一

来源：互联网发布：淘宝店铺管控记录编辑：程序博客网时间：2024/05/10 11:53

一先来一个程序再说

root@ubuntu:~/work#vi t1.py

from sklearn.datasets import load_iris

from sklearn.cross_validation import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

iris= load_iris()

X= iris.data

y= iris.target

for i in xrange(1,5):

print "random_state is ",i,", and accuracy score is:"

X_train, X_test, y_train, y_test =train_test_split(X, y,test_size=0.2, random_state=i)

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print metrics.accuracy_score(y_test,y_pred)

"t1.py"22L, 585C written

root@ubuntu:~/work#python t1.py

/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:44:DeprecationWarning: This module was deprecated in version 0.18 in favor of themodel_selection module into which all the refactored classes and functions aremoved. Also note that the interface of the new CV iterators are different fromthat of this module. This module will be removed in 0.20.

"This module will be removed in0.20.", DeprecationWarning)

random_stateis 1 , and accuracy score is:

1.0

random_stateis 2 , and accuracy score is:

1.0

random_stateis 3 , and accuracy score is:

0.966666666667

random_stateis 4 , and accuracy score is:

0.966666666667

解说如下：

1 iris = load_iris()，加载经典的鸢尾花数据

2 X_train,X_test, y_train, y_test = train_test_split(X, y, test_size =0.2,random_state=i)

X：所要划分的样本特征集

y：所要划分的样本结果

test_size：样本占比，如果是整数的话就是样本的数量。此处设置为0.2，是说训练样本占总样本的80%，测试样本占总样本的20%。譬如说样本150个，训练样本占80%，训练样本的个数为120个，测试样本占20%，测试样本的个数为30个。

random_state：是随机数的种子。随机数种子：其实就是该组随机数的编号，在需要重复试验的时候，保证得到一组一样的随机数。比如你每次都填1，其他参数一样的情况下你得到的随机数组是一样的。但填0或不填，每次都会不一样。随机数的产生取决于种子，随机数和种子之间的关系遵从以下两个规则：种子不同，产生不同的随机数；种子相同，即使实例不同也产生相同的随机数。

3 knn = KNeighborsClassifier(n_neighbors=5)，模型采用K近邻

4 knn.fit(X_train, y_train)，用训练数据训练模型

5 y_pred = knn.predict(X_test)，用测试数据预测

6 print metrics.accuracy_score(y_test,y_pred)，调用精度评价函数（真实结果与预测结果）

另外，test_size取值是有讲究的

当test_size=0.95的时候，此时测试样本占比0.95，训练样本仅占0.05，准确率大大降低，如下所示：

root@ubuntu:~/work# vi t1.py

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# read in the iris data
iris = load_iris()

X = iris.data
y = iris.target

for i in xrange(1,5):
print "random_state is ", i,", and accuracy score is:"
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.95, random_state=i)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)
~
~
~
~
~
"t1.py" 22L, 586C written
root@ubuntu:~/work# python t1.py
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
random_state is 1 , and accuracy score is:
0.363636363636
random_state is 2 , and accuracy score is:
0.321678321678
random_state is 3 , and accuracy score is:
0.314685314685
random_state is 4 , and accuracy score is:
0.545454545455

二再举一个train_test_split的例子

root@ubuntu:~/work#vi t3.py

from sklearn.datasets import load_iris

x=load_iris().data

y=load_iris().target

printx.shape

printy.shape

print"111"

fromsklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

printx_train.shape

printx_test.shape

"t3.py"11L, 279C written

root@ubuntu:~/work#python t3.py

(150,4)

(150,)

111

"This module will be removed in0.20.", DeprecationWarning)

(120,4)

(30,4)

说明如下：

iris数据集原本是150个样本，每一个样本4个特征

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

test_size=0.2,意思是说测试集占比20%，150*20%=30个，训练集样本则为120个

0 0