KMeans聚类算法示例
来源:互联网 发布:python config 编辑:程序博客网 时间:2024/05/20 14:43
三个例子:1.二位点聚类 2.手写字符聚类 3.图像压缩
Clustering: K-Means In-Depth
Here we’ll explore K Means Clustering, which is an unsupervised clustering technique.
We’ll start with our standard set of initial imports
%matplotlib inlineimport numpy as npimport matplotlib.pyplot as pltfrom scipy import stats
Introducing K-Means
K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).
K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.
Let’s look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we’ll not plot the colors of the clusters:
from sklearn.datasets.samples_generator import make_blobsX, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=0.60)plt.scatter(X[:, 0], X[:, 1], s=50);
By eye, it is relatively easy to pick out the four clusters. If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points. Fortunately, there is a well-known Expectation Maximization (EM) procedure which scikit-learn implements, so that KMeans can be solved relatively quickly.
from sklearn.cluster import KMeansest = KMeans(4) # 4 clustersest.fit(X)y_kmeans = est.predict(X)plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');
The algorithm identifies the four clusters of points in a manner very similar to what we would do by eye!
The K-Means Algorithm: Expectation Maximization
K-Means is an example of an algorithm which uses an Expectation-Maximization approach to arrive at the solution.
Expectation-Maximization is a two-step approach which works as follows:
- Guess some cluster centers
- Repeat until converged
A. Assign points to the nearest cluster center
B. Set the cluster centers to the mean
Let’s quickly visualize this process:
from fig_code import plot_kmeans_interactiveplot_kmeans_interactive();
This algorithm will (often) converge to the optimal cluster centers.
KMeans Caveats
The convergence of this algorithm is not guaranteed; for that reason, scikit-learn by default uses a large number of random initializations and finds the best results.
Also, the number of clusters must be set beforehand… there are other clustering algorithms for which this requirement may be lifted.
Application of KMeans to Digits
For a closer-to-real-world example, let’s again take a look at the digits data. Here we’ll use KMeans to automatically cluster the data in 64 dimensions, and then look at the cluster centers to see what the algorithm has found.
from sklearn.datasets import load_digitsdigits = load_digits()
est = KMeans(n_clusters=10)clusters = est.fit_predict(digits.data)est.cluster_centers_.shape
(10, 64)
We see ten clusters in 64 dimensions. Let’s visualize each of these cluster centers to see what they represent:
fig = plt.figure(figsize=(8, 3))for i in range(10): ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[]) ax.imshow(est.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)
We see that even without the labels, KMeans is able to find clusters whose means are recognizable digits (with apologies to the number 8)!
Example: KMeans for Color Compression
One interesting application of clustering is in color image compression. For example, imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and conversely a large number of pixels will have similar or identical colors.
Scikit-learn has a number of images that you can play with, accessed through the datasets module. For example:
from sklearn.datasets import load_sample_imagechina = load_sample_image("china.jpg")plt.imshow(china)plt.grid(False);
The image itself is stored in a 3-dimensional array, of size (height, width, RGB)
:
china.shape
(427, 640, 3)
We can envision this image as a cloud of points in a 3-dimensional color space. We’ll rescale the colors so they lie between 0 and 1, then reshape the array to be a typical scikit-learn input:
X = (china / 255.0).reshape(-1, 3)print(X.shape)
(273280, 3)
We now have 273,280 points in 3 dimensions.
Our task is to use KMeans to compress the
# reduce the size of the image for speedimage = china[::3, ::3]print(image.shape)n_colors = 64X = (image / 255.0).reshape(-1, 3)model = KMeans(n_colors)labels = model.fit_predict(X)colors = model.cluster_centers_new_image = colors[labels].reshape(image.shape)new_image = (255 * new_image).astype(np.uint8)plt.figure()plt.imshow(image)plt.title('input')plt.figure()plt.imshow(new_image)plt.title('{0} colors'.format(n_colors))
(143, 214, 3)<matplotlib.text.Text at 0x23c6979b0b8>
Compare the input and output image: we’ve reduced the
- KMeans聚类算法示例
- Kmeans算法及其示例
- KMEANS聚类算法
- KMeans聚类算法
- kmeans聚类算法
- Kmeans聚类算法
- Kmeans聚类算法
- kmeans 聚类算法
- Kmeans 聚类算法
- KMeans聚类算法
- kmeans聚类算法
- KMeans聚类算法
- kmeans聚类算法
- KMeans聚类算法
- KMeans聚类算法
- Kmeans聚类算法
- kmeans聚类算法学习
- Opencv Kmeans聚类算法
- jquery和prototype.js的区别
- 【JVM学习系列】 JVM内部架构
- 揭示OGG DataPump进程和Server进程运行原理的几篇文章
- CSS知识点
- Effective Java:对于所有对象都通用的方法
- KMeans聚类算法示例
- 文章标题
- 面向对象和面向结构的编程特点和区别
- windows下python ide关联.py文件
- android蓝牙4.0 BLE低功耗应用
- npm的版本升级 (windows)
- Android 文件下载 学习笔记
- javascript学习(十八)javascript事件
- spring-boot @Async 的使用、自定义Executor的配置方法