聚类算法学习----之----sklearn.cluster.KMeans

来源：互联网发布：怎样快速提升淘宝信誉编辑：程序博客网时间：2024/05/18 00:46

class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

（一）输入参数：

（1）n_clusters：分成的簇数（要生成的质心数）=====>整型，[可选]，默认值=8；

n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.

（2）init：初始化质心的方法====>有三个可选值：'k-means++'， 'random'，或者传递一个ndarray向量，默认为'k-means++'

‘k-means++’ 用一种智能的方法选定初始质心从而能加速迭代过程的收敛，参见 k_init 的解释获取更多信息。
‘random’ 随机从训练数据中选取初始质心。
如果传递的是一个ndarray，则应该形如 (n_clusters, n_features) 并给出初始质心。

init : {‘k-means++’, ‘random’ or an ndarray}

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

（3）n_init:：用不同的质心初始化值运行算法的次数====>整型，默认值=10次，最终解是在inertia意义下选出的最优结果。

（ps：每一次算法运行时开始的centroid seeds是随机生成的, 这样得到的结果也可能有好有坏. 所以要运行算法n_init次, 取其中最好的。）

n_init : int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

（4）max_iter：算法每次迭代的最大次数====>整型，默认值=300

max_iter : int, default: 300

Maximum number of iterations of the k-means algorithm for a single run.

（5）tol：与inertia结合来确定收敛条件====> float型，默认值= 1e-4

tol : float, default: 1e-4

Relative tolerance with regards to inertia to declare convergence

（6）precompute_distances：预计算距离，计算速度更快但占用更多内存 ====>类型：（auto，True，False）三个值可选，,默认值=“auto”

‘auto’：如果样本数乘以聚类数大于 12million 的话则不预计算距离‘’

‘True‘：总是预先计算距离。

‘False‘：永远不预先计算距离。

这个参数会在空间和时间之间做权衡，如果是True 会把整个距离矩阵都放到内存中，auto 会默认在数据样本大于featurs*samples 的数量大于12e6 的时候False,False时

核心实现的方法是利用Cpython 来实现的

precompute_distances : {‘auto’, True, False}

Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job usingdoubleprecision.

True : always precompute distances

False : never precompute distances

（7）verbose:是否输出详细信息====>类型：整型，默认值=0

verbose : int, default 0

Verbosity mode.

（8）random_state：用于初始化质心的生成器（generator），和初始化中心有关。

random_state : int, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator;

If RandomState instance, random_state is the random numbergenerator;

If None, the random number generator is the RandomState instance used by np.random.

（9）copy_x：是否对输入数据继续copy 操作====> 布尔型，默认值=True

当我们precomputing distances时，将数据中心化会得到更准确的结果。

如果把此参数值设为True，则原始数据不会被改变。

如果是False，则会直接在原始数据上做修改并在函数返回值时将其还原。

但是在计算过程中由于有对数据均值的加减运算，所以数据返回后，原始数据和计算前可能会有细小差别。

copy_x : boolean, default True

When pre-computing distances it is more numerically accurate to center the data first.

If copy_x is True, then the original data is not modified.

If False, the original data is modified, and put back before the function returns,

but small numerical differences may be introduced by subtracting and then adding the data mean.

（10）n_jobs：使用进程的数量，与电脑的CPU有关====>类型：整型，默认值=1

指定计算所用的进程数。内部原理是同时进行n_init指定次数的计算。

若值为 -1，则用所有的CPU进行运算。

若值为1，则不进行并行运算，这样的话方便调试。

若值小于-1，则用到的CPU数为(n_cpus + 1 + n_jobs)。因此如果 n_jobs值为-2，则用到的CPU数为总CPU数减1。

n_jobs : int

The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

If -1 all CPUs are used.

If 1 is given, no parallel computing code is used at all, which is useful for debugging.

For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

（11）algorithm：k-means算法的种类====>（“auto”, “full” or “elkan”）三个值可选，默认值=‘auto’

“full”采用的是经典EM-style算法的。

“elkan”则在使用三角不等式时显得更为高效,但目前不支持稀疏数据。

“auto”则在密集数据时选择“elkan”，在稀疏数据是选择“full”。

algorithm : : “auto”, “full” or “elkan”, default=”auto”

K-means algorithm to use. The classical EM-style algorithm is “full”.

The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data.

“auto” chooses “elkan” for dense data and “full” for sparse data.

（二）属性
cluster_centers_：向量，[n_clusters, n_features]
Coordinates of cluster centers (找出聚类中心)
Labels_:每个点的分类
inertia_：float型，每个点到其簇的质心的距离之和。

cluster_centers_ : array, [n_clusters, n_features]

Coordinates of cluster centers

labels_ : :Labels of each point

inertia_ : float,Sum of distances of samples to their closest cluster center.

（三）例子

>>> from sklearn.cluster import KMeans>>> import numpy as np>>> X = np.array([[1, 2], [1, 4], [1, 0],...               [4, 2], [4, 4], [4, 0]])>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)>>> kmeans.labels_array([0, 0, 0, 1, 1, 1], dtype=int32)>>> kmeans.predict([[0, 0], [4, 4]])array([0, 1], dtype=int32)>>> kmeans.cluster_centers_array([[ 1.,  2.],       [ 4.,  2.]])

（四）相关方法

Methods

fit(X[, y])Compute k-means clustering.fit_predict(X[, y])Compute cluster centers and predict cluster index for each sample.fit_transform(X[, y])Compute clustering and transform X to cluster-distance space.get_params([deep])Get parameters for this estimator.predict(X)Predict the closest cluster each sample in X belongs to.score(X[, y])Opposite of the value of X on the K-means objective.set_params(**params)Set the parameters of this estimator.transform(X)Transform X to a cluster-distance space.fit(X[,y])计算k-means聚类；
fi_predictt(X[,y])计算簇质心并给每个样本预测类别；
fit_transform(X[,y])计算簇并把X装换到cluster-distance空间；
get_params([deep])取得估计器的参数；
predict(X)给每个样本估计最接近的簇；
score(X[,y])与k-means算法目标相反的值；
set_params(**params)Set the parameters of this estimator;
transform(X[,y])将X转换入cluster-distance 空间。

（五）测试

#encoding = utf-8"""@version:??@author: xq@contact:xiaoq_xiaoq@163.com@file: test.py@time: 2017/10/18 14:29"""import pandas as pdimport matplotlib.pyplot as pltfrom  sklearn.cluster import KMeansfrom matplotlib.font_manager import FontPropertiesclass clusterApi(object):    def __init__(self,data):        self.data = data        self.font = FontProperties(fname='C:/Windows/Fonts/msyh.ttf')#设置中文字体    def initData(self):        '''        数据预处理,统一数据格式        :return: 固定格式的数据        '''        initdata = pd.DataFrame(self.data)        scatterData = initdata[['Id', 'lat', 'lng']]        return scatterData    def k_meansUp(self):        pointsData = self.initData()#要分类的数据        plt.figure()        plt.subplot(331)#绘制子图        lats = pointsData.lat        lngs =pointsData.lng        plt.title(u'样本',fontproperties=self.font)#设置图的标题        plt.scatter(lngs, lats,s=3)#绘制样本图        colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'b']#画图颜色        markers = ['o', 's', 'D', 'v', '^', 'p', '*', '+']#画图形状        testsK = [2, 3, 4, 5, 8]#k值的取值        subplot_counter = 1#子图的位置        for t in testsK:            subplot_counter += 1            plt.subplot(3, 2, subplot_counter)            kmeans_model = KMeans(n_clusters=t).fit(pointsData)            for i, l in enumerate(kmeans_model.labels_):                plt.plot(lngs[i], lats[i],markersize=2,color=colors[l],marker=markers[l], ls='None')                plt.title(u'K = %s' %t , fontproperties=self.font)        plt.show()def main():    #测试数据    stopList= [{'Id': '50001','lat': 28.571906,'lng': 112.337788},               {'Id': '50001','lat': 28.573678,'lng': 112.381103},               { 'Id': '50001','lat': 28.571915,'lng': 112.337533},               { 'Id': '50001','lat': 28.573978,'lng': 112.35765},                { 'Id': '50001','lat': 28.572656,'lng': 112.3366},               {'Id': '50001', 'lat': 28.578011, 'lng': 112.330688},               {'Id': '50001', 'lat': 28.572228, 'lng': 112.335841},               {'Id': '50001', 'lat': 28.57849, 'lng': 112.3338},               {'Id': '50001', 'lat': 28.57239, 'lng': 112.336491},               {'Id': '50001', 'lat': 28.577943, 'lng': 112.330995},               {'Id': '50001', 'lat': 28.571921, 'lng': 112.337783},               {'Id': '50001', 'lat': 28.572401, 'lng': 112.3359},               {'Id': '50001', 'lat': 28.569629, 'lng': 112.34005},               {'Id': '50001', 'lat': 28.588048, 'lng': 112.337783},               {'Id': '50001', 'lat': 28.572035, 'lng': 112.335683},               {'Id': '50001', 'lat': 28.560938, 'lng': 112.378183},               {'Id': '50001', 'lat': 28.544781, 'lng': 112.494936},               {'Id': '50001', 'lat': 28.572296, 'lng': 112.336288},               {'Id': '50001', 'lat': 28.571951, 'lng': 112.337806},               {'Id': '50001', 'lat': 28.571551, 'lng': 112.32685}]    print('共有%d个点'%len(stopList))    clustertest = clusterApi(stopList)#实例化    clustertest.k_meansUp()#聚类画图if __name__ == '__main__':    main()

参考文章：

http://blog.csdn.net/xiaoyi_zhang/article/details/52269242

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

http://www.cnblogs.com/wuchuanying/p/6264025.html

阅读全文

0 0