其他聚类

来源：互联网发布：js url decode 编辑：程序博客网时间：2024/06/06 05:18

基于网格的方法（grid-based methods）：

这里写图片描述

网格方法是空间数据处理中常用的将空间数据离散化的方法。它将对象空间量化成有限数目的单元，这些网格形成了网格结构，所有的聚类结构都在该结构上进行。跟层次聚类一样，它也有从上到下和从下到上两种形态，其思想也大致差不多。从下到上的的代表性算法是 WaveCluster和CLIQUE。从上到下主要有OptiGrid与CLTree。而网格法的主要优点是处理速度快，易于增量实现，并行处理和善于进行高维数据处理。
STING也是一种基于网格的多分辨率的聚类技术，采用了多分辨率的方法。但是聚类的质量取决于网格结构的最低层的粒度，时间VS精度。

基于模型的方法(model-based methods)：
基于模型的方法给每一个聚类假定一个模型，然后去寻找能一个很好的满足这个模型的数据集。这样一个模型可能是数据点在空间中的密度分布函数或者其它。有一些像是EM算法，蒙特卡洛，基于神经网络的SOON以及各种混合算法。
而且到目前为止，单维聚类方法已经不再适合大数据的多样性特征，只有多维聚类分析通过对单维聚类问题的扩展，才能为复杂数据提供了一种新的探索性分析的方式。

模糊聚类分析
在现实生活中，有些事情是午饭完全的得到答案的，有时候往往十分的模糊，为了避免非此即彼的分类造成比较大的疏漏，就出现了以模糊数学为基础的聚类分析。模糊聚类分析就是是根据客观事物间的特征、亲疏程度、相似性，通过建立模糊相似关系对客观事物进行聚类的分析方法。
FCM算法是一种以隶属度来确定每个数据点属于某个聚类程度的算法。通过建立模糊相似矩阵，初始化隶属矩阵不断迭代收敛，最后由隶属矩阵来确定数据所属的类。

聚类算法大应用：

import time#用于比较算法之间的时间import warningsimport numpy as npimport matplotlib.pyplot as pltfrom sklearn import cluster, datasets, mixturefrom sklearn.neighbors import kneighbors_graphfrom sklearn.preprocessing import StandardScalerfrom itertools import cycle, islicenp.random.seed(0)#多种不同的数据形态n_samples = 1500noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,                                      noise=.05)noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)no_structure = np.random.rand(n_samples, 2), None#变换random_state = 170X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)transformation = [[0.6, -0.6], [-0.4, 0.8]]X_aniso = np.dot(X, transformation)aniso = (X_aniso, y)varied = datasets.make_blobs(n_samples=n_samples,                             cluster_std=[1.0, 2.5, 0.5],                             random_state=random_state)#切分结构图plt.figure(figsize=(9 * 2 + 3, 12.5))plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,                    hspace=.01)plot_num = 1#基础缺省值default_base = {'quantile': .3,                'eps': .3,                'damping': .9,                'preference': -200,                'n_neighbors': 10,                'n_clusters': 3}datasets = [    (noisy_circles, {'damping': .77, 'preference': -240,                     'quantile': .2, 'n_clusters': 2}),    (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),    (varied, {'eps': .18, 'n_neighbors': 2}),    (aniso, {'eps': .15, 'n_neighbors': 2}),    (blobs, {}),    (no_structure, {})]#对每种不同的数据集for i_dataset, (dataset, algo_params) in enumerate(datasets):    params = default_base.copy()    params.update(algo_params)    X, y = dataset    #数据标准化    X = StandardScaler().fit_transform(X)    #某些算法特有参数    bandwidth = cluster.estimate_bandwidth(X, quantile=params['quantile'])    connectivity = kneighbors_graph(        X, n_neighbors=params['n_neighbors'], include_self=False)    connectivity = 0.5 * (connectivity + connectivity.T)    #各类算法模型们    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)    two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])    ward = cluster.AgglomerativeClustering(        n_clusters=params['n_clusters'], linkage='ward',        connectivity=connectivity)    spectral = cluster.SpectralClustering(        n_clusters=params['n_clusters'], eigen_solver='arpack',        affinity="nearest_neighbors")    dbscan = cluster.DBSCAN(eps=params['eps'])    affinity_propagation = cluster.AffinityPropagation(        damping=params['damping'], preference=params['preference'])    average_linkage = cluster.AgglomerativeClustering(        linkage="average", affinity="cityblock",        n_clusters=params['n_clusters'], connectivity=connectivity)    birch = cluster.Birch(n_clusters=params['n_clusters'])    gmm = mixture.GaussianMixture(        n_components=params['n_clusters'], covariance_type='full')    clustering_algorithms = (        ('MiniBatchKMeans', two_means),        ('AffinityPropagation', affinity_propagation),        ('MeanShift', ms),        ('SpectralClustering', spectral),        ('Ward', ward),        ('AgglomerativeClustering', average_linkage),        ('DBSCAN', dbscan),        ('Birch', birch),        ('GaussianMixture', gmm)    )    for name, algorithm in clustering_algorithms:        t0 = time.time()        # catch warnings        with warnings.catch_warnings():            warnings.filterwarnings(                "ignore",                message="the number of connected components of the " +                "connectivity matrix is [0-9]{1,2}" +                " > 1. Completing it to avoid stopping the tree early.",                category=UserWarning)            warnings.filterwarnings(                "ignore",                message="Graph is not fully connected, spectral embedding" +                " may not work as expected.",                category=UserWarning)            algorithm.fit(X)        t1 = time.time()        if hasattr(algorithm, 'labels_'):            y_pred = algorithm.labels_.astype(np.int)        else:            y_pred = algorithm.predict(X)        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)        if i_dataset == 0:            plt.title(name, size=18)        colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',                                             '#f781bf', '#a65628', '#984ea3',                                             '#999999', '#e41a1c', '#dede00']),                                      int(max(y_pred) + 1))))        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])        plt.xlim(-2.5, 2.5)        plt.ylim(-2.5, 2.5)        plt.xticks(())        plt.yticks(())        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),                 transform=plt.gca().transAxes, size=15,                 horizontalalignment='right')        plot_num += 1plt.show()

这里写图片描述

从上图可以对不同的聚类算法的运行时间有直观的认识，而且对于他们适用的形状集也能有大致的了解。

阅读全文

0 0