Python实现k-means算法

来源：互联网发布：符文战争桌游淘宝编辑：程序博客网时间：2024/05/17 10:28

这也是周志华《机器学习》的习题9.4。
数据集是西瓜数据集4.0，如下

编号,密度,含糖率1,0.697,0.462,0.774,0.3763,0.634,0.2644,0.608,0.3185,0.556,0.2156,0.403,0.2377,0.481,0.1498,0.437,0.2119,0.666,0.09110,0.243,0.26711,0.245,0.05712,0.343,0.09913,0.639,0.16114,0.657,0.19815,0.36,0.3716,0.593,0.04217,0.719,0.10318,0.359,0.18819,0.339,0.24120,0.282,0.25721,0.784,0.23222,0.714,0.34623,0.483,0.31224,0.478,0.43725,0.525,0.36926,0.751,0.48927,0.532,0.47228,0.473,0.37629,0.725,0.44530,0.446,0.459

算法很简单，就不解释了，代码也不复杂，直接放上来：

# -*- coding: utf-8 -*- """Excercise 9.4"""import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport sysimport randomdata = pd.read_csv(filepath_or_buffer = '../dataset/watermelon4.0.csv', sep = ',')[["密度","含糖率"]].values########################################## K-means ####################################### k = int(sys.argv[1])#Randomly choose k samples from data as mean vectorsmean_vectors = random.sample(data,k)def dist(p1,p2):    return np.sqrt(sum((p1-p2)*(p1-p2)))while True:    print mean_vectors    clusters = map ((lambda x:[x]), mean_vectors)     for sample in data:        distances = map((lambda m: dist(sample,m)), mean_vectors)         min_index = distances.index(min(distances))        clusters[min_index].append(sample)    new_mean_vectors = []    for c,v in zip(clusters,mean_vectors):        new_mean_vector = sum(c)/len(c)        #If the difference betweenthe new mean vector and the old mean vector is less than 0.0001        #then do not updata the mean vector        if all(np.divide((new_mean_vector-v),v) < np.array([0.0001,0.0001]) ):            new_mean_vectors.append(v)           else:            new_mean_vectors.append(new_mean_vector)       if np.array_equal(mean_vectors,new_mean_vectors):        break    else:        mean_vectors = new_mean_vectors #Show the clustering resulttotal_colors = ['r','y','g','b','c','m','k']colors = random.sample(total_colors,k)for cluster,color in zip(clusters,colors):    density = map(lambda arr:arr[0],cluster)    sugar_content = map(lambda arr:arr[1],cluster)    plt.scatter(density,sugar_content,c = color)plt.show()

运行方式：在命令行输入 python k_means.py 4。其中4就是k。
下面是k分别等于3，4，5的运行结果，因为一开始的均值向量是随机的，所以每次运行结果会有不同。
k=3

k=4

k=5

0 0