机器学习之无监督聚类问题

来源：互联网发布：知乎矢仓枫子编辑：程序博客网时间：2024/06/01 08:24

这里写图片描述
通过判断彼此间的距离来实现聚类

#无监督：数据没有明确说明属于哪一类，无需去训练模型import pandas as pdvotes = pd.read_csv("D:\\test\machineLearning\\114_congress.csv")print(votes["party"].value_counts())print votes.mean()

R    54D    44I     2Name: party, dtype: int6400001    0.32500004    0.57500005    0.53500006    0.94500007    0.54500008    0.41500009    0.54500010    0.98500020    0.52500026    0.54500032    0.41000038    0.48000039    0.51000044    0.46000047    0.370dtype: float64

from sklearn.metrics.pairwise import euclidean_distances#欧式距离#reshape指定了行的维度，-1代表列的维度由程序自己推断print(euclidean_distances(votes.iloc[0,3:].reshape(1,-1),votes.iloc[1,3:].reshape(1,-1)))distance = euclidean_distances(votes.iloc[0,3:].reshape(1,-1),votes.iloc[2,3:].reshape(1,-1))

[[ 1.73205081]]

import pandas as pdfrom sklearn.cluster import KMeans#KMeans是聚类算法的一种，n_clusters是堆的个数，你需要分成几类，#random_state=1代表随机值是一样的，这样一来聚类的结果也是一样的kmeans_model = KMeans(n_clusters=2,random_state=1)#求出分类后的距离senator_distances = kmeans_model.fit_transform(votes.iloc[:,3:])

#分成2类，0，1代表不同的类，label就是将这些打印出来labels=kmeans_model.labels_#print labels #从这一步分数据可以看出，数据被分成2类，其中D和R区分明显，说明分类有效#crosstab生成一个列表，统计一下不同的label有多少值print(pd.crosstab(labels,votes["party"]))democratic = votes[(labels == 1)&(votes["party"]!="D")]

party   D  I   Rrow_0           0      41  2   01       3  0  54

#以上数据label为1的D有3个，和R走的很近，如何将他们找出来democratic = votes[(labels == 1)&(votes["party"]=="D")]print democratic

        name party state  00001  00004  00005  00006  00007  00008  00009  \42  Heitkamp     D    ND    0.0    1.0    0.0    1.0    0.0    0.0    1.0   56   Manchin     D    WV    0.0    1.0    0.0    1.0    0.0    0.0    1.0   74      Reid     D    NV    0.5    0.5    0.5    0.5    0.5    0.5    0.5       00010  00020  00026  00032  00038  00039  00044  00047  42    1.0    0.0    0.0    0.0    1.0    0.0    0.0    0.0  56    1.0    1.0    0.0    0.0    1.0    1.0    0.0    0.0  74    0.5    0.5    0.5    0.5    0.5    0.5    0.5    0.5

import matplotlib.pyplot as plt#用散点图表示,使数据更加清晰plt.scatter(x=senator_distances[:,0],y=senator_distances[:,1],c=labels)plt.show()

这里写图片描述

#离群点分析，距离最远extremism = (senator_distances ** 3).sum(axis=1)votes["extremism"] = extremismvotes.sort_values("extremism",inplace=True,ascending=False)print votes.head(2)

        name party state  00001  00004  00005  00006  00007  00008  00009  \98    Wicker     R    MS    0.0    1.0    1.0    1.0    1.0    0.0    1.0   53  Lankford     R    OK    0.0    1.0    1.0    0.0    1.0    0.0    1.0       00010  00020  00026  00032  00038  00039  00044  00047  extremism  98    0.0    1.0    1.0    0.0    0.0    1.0    0.0    0.0  46.250476  53    1.0    1.0    1.0    0.0    0.0    1.0    0.0    0.0  46.046873

0 0