sklearn Hierarchical Clustering

来源：互联网发布：淘宝虚拟物品类目编辑：程序博客网时间：2024/05/22 06:56

层次聚类
基于一定的规则生成树形结构（各个类数）,比较消耗性能。

AgglomerativeClustering: 使用自底向上的聚类方法。
主要有三种聚类准则：
complete(maximum) linkage: 两类间的距离用最远点距离表示。
avarage linkage:平均距离。
ward's method: 以组内平方和最小，组间平方和最大为目的。

numpy.apply_along_axis(func1d, axis)
该函数对指定的axis应用func1d并返回相应的数组，
当设定的函数是多维函数（对数组的每一个变量进行处理时，返回与输入数组
相同形式的数组，这是明显的）

scipy.ndimage.shift:
这里shift指平移变换，这里对数据进行样条插值变换，默认利用3阶多项式进行
插值（order = 3）应当理解为对图像进行平移时遇到原图没有的点利用插值生成。
mode = 'constant',超过插值范围的数用常数进行过滤.
input为输入数组，shift为对应的平移参数，当为float时为所有轴相同
平移，还可以按每个轴方向输入平移数组。

下面的函数nudge_images 相当于利用平移插值增加了一倍的样本量。
这提供了一种增加样本量的方法。（对原数据，当不知道分布时）

sklearn.manifold.SpectralEmbedding
利用谱聚类中步骤实现降维。（使用图的拉普拉斯矩阵较小特征值对应的特征向量
给出数据的表征）
affinify指定使用的相似性矩阵。
这是一种降维方法，与PCA是可以做比较的，因为其目的是样本数据分类特征最优，
与数据内部方差最大类似。

下面是例子：（由于谱聚类降维及层次聚类对高维数据比较消耗时间，运行时间会比较长）

from time import time import numpy as np from scipy import ndimage from matplotlib import pyplot as plt from sklearn import manifold, datasets digits = datasets.load_digits(n_class = 10)X = digits.data y = digits.target n_samples, n_features = X.shape np.random.seed(0)def nudge_image(X, y): shift = lambda x: ndimage.shift(x.reshape((8, 8)), .3 * np.random.normal(size = 2),     mode = 'constant').ravel()  X = np.concatenate([X, np.apply_along_axis(shift, 1, X)]) Y = np.concatenate([y, y], axis = 0) return X, Y X, y = nudge_image(X, y)def plot_clustering(X_red, X, labels, title = None): x_min, x_max = np.min(X_red, axis = 0), np.max(X_red, axis = 0) X_red = (X_red - x_min) / (x_max - x_min) plt.figure(figsize = (6, 4)) for i in range(X_red.shape[0]):  plt.text(X_red[i,0], X_red[i,1], str(y[i]), color = plt.cm.spectral(labels[i]/10.),     fontdict = {'weight': 'bold', 'size': 9}) plt.xticks([]) plt.yticks([]) if title is not None:  plt.title(title, size = 17) plt.axis('off') plt.tight_layout()print "Computing embedding"X_red = manifold.SpectralEmbedding(n_components = 2).fit_transform(X)print "Done."from sklearn.cluster import AgglomerativeClustering for linkage in ('ward', 'average', 'complete'): clustering = AgglomerativeClustering(linkage = linkage, n_clusters = 10) t0 = time() clustering.fit(X_red) print "%s : %.2fs" % (linkage, time() - t0) plot_clustering(X_red, X, clustering.labels_, "%s linkage" % linkage)plt.show()

0 0