t-SNE初学

来源：互联网发布：试客系统整站源码编辑：程序博客网时间：2024/04/30 06:37

http://www.datakit.cn/blog/2015/08/06/t_SNE.html

本文主要参考wikipedia，介绍t-SNE算法，以及python下的一些实现可视化应用。

1、概述

最开始接触t-SNE是在kaggle的比赛里，看到很多人提到t-SNE，用于降维和可视化。以前在可视化高维数据的时候，一般是降维到2维里可视化，降维的方法通常选择PCA，但是PCA是线性的，效果比较一般。这里介绍的t-SNE（t-distributed stochastic neighbor embedding）是用于降维的一种机器学习算法，是由 Laurens van der Maaten 和 Geoffrey Hinton在08年提出来的，论文参见JMLR-Visualizing High-Dimensional Data Using t-SNE。t-SNE 是一种非线性降维算法，非常适用于高维数据降维到2维或者3维，进行可视化。

2、原理

2.1基本原理

t-SNE主要包括两个步骤：第一、t-SNE构建一个高维对象之间的概率分布，使得相似的对象有更高的概率被选择，而不相似的对象有较低的概率被选择。第二，t-SNE在低维空间里在构建这些点的概率分布，使得这两个概率分布之间尽可能的相似（这里使用KL散度（Kullback–Leibler divergence）来度量两个分布之间的相似性）。

2.2详细过程

具体来说，给定一个N个高维的数据x1,...,xN（注意N不是维度！）, t-SNE首先是计算概率pij，正比于xi和xj之间的相似度（这种概率是我们自主构建的），公式如下：

p j ∣ i = e x p ( - ∣ ∣ x i - x j ∣ ∣ 2 / ( 2 σ 2 i ) ) \sum k \neq i e x p ( - ∣ ∣ x i - x k ∣ ∣ 2 / ( 2 σ 2 i ) )

p i j = p j ∣ i p i ∣ j 2 N

这里看到是用高斯核来构建了概率分布，那么怎么选择高斯核中的σi呢？使用二分搜索得到条件概率分布的perplexity（后面再提到）。

t-SNE的目标是学习一个d维度的映射yi,...,yN,yi∈Rd, 这里定义yi和yj之间的相似度qij如下:

q i j = ( 1 + ∣ ∣ y i - y j ∣ ∣ 2 ) - 1 \sum k \neq l ( 1 + ∣ ∣ y k - y l ∣ ∣ 2 ) - 1

这里使用了学生分布来衡量低维度下点之间的相似度。最后，我们使用KL散度来度量Q和P之间的相似度：

C = K L (P ∣ ∣) = \sum i \neq j p i, j log p i j q i j

之后使用梯度下降来最小化KL散度，梯度值如下：

d C d y i = 4 \sum j (p i j - q i j) (y i - y j) (1 + ∣ ∣ y i - y j ∣ ∣ 2) - 1

t-SNE几乎在所有论文中的数据集上效果比 Sammon mapping, Isomap, and Locally Linear Embedding 要好。

2.4理由

为什么选择这样的分布论文中，开始使用了高斯核，之后改用了heavy-tailed t分布，因为这种t分布中 (1+∣∣yi−yj∣∣2)−1与低维空间里∣∣yi−yj∣∣的二次成反比，能够使得不相似的两个对象被更好的分割
高斯核中σi的选择高斯核中σi的选择, 不同的i是对应了不同的σi,取值是用perplexity，当然可以直接看wiki和论文了，这里简单的叙述下perplexity定义为： Perp(Pi)=2H(Pi) ,其中，H(Pi)是Pi的信息熵，即H(Pi)=−∑jpj∣ilog2p(j∣i), 可以解释为实际有效近邻数。

3、算法流程

Simple version of t-Distributed Stochastic Neighbor Embedding

Data: X=x1,...,xn
计算cost function的参数： perplexity Perp
优化参数: 设置迭代次数T，学习速率η, 动量α(t)
目标结果是低维数据表示 YT=y1,...,yn
开始优化
- 计算在给定Perp下的条件概率pj∣i(参见上面公式)
- 令 pij=pj∣i+pi∣j2n
- 用 N(0,10−4I) 随机初始化 Y
- 迭代，从 t = 1 到 T，做如下操作:
  - 计算低维度下的 qij(参见上面的公式)
  - 计算梯度（参见上面的公式）
  - 更新 Yt=Yt−1+ηdCdY+α(t)(Yt−1−Yt−2)
- 结束
结束

4、python试用

# Authors: Fabian Pedregosa <fabian.pedregosa@inria.fr>#          Olivier Grisel <olivier.grisel@ensta.org>#          Mathieu Blondel <mathieu@mblondel.org>#          Gael Varoquaux# License: BSD 3 clause (C) INRIA 2011print(__doc__)from time import timeimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib import offsetboxfrom sklearn import (manifold, datasets, decomposition, ensemble, lda,                     random_projection)digits = datasets.load_digits(n_class=6)X = digits.datay = digits.targetn_samples, n_features = X.shapen_neighbors = 30#----------------------------------------------------------------------# Scale and visualize the embedding vectorsdef plot_embedding(X, title=None):    x_min, x_max = np.min(X, 0), np.max(X, 0)    X = (X - x_min) / (x_max - x_min)    plt.figure()    ax = plt.subplot(111)    for i in range(X.shape[0]):        plt.text(X[i, 0], X[i, 1], str(digits.target[i]),                 color=plt.cm.Set1(y[i] / 10.),                 fontdict={'weight': 'bold', 'size': 9})    if hasattr(offsetbox, 'AnnotationBbox'):        # only print thumbnails with matplotlib > 1.0        shown_images = np.array([[1., 1.]])  # just something big        for i in range(digits.data.shape[0]):            dist = np.sum((X[i] - shown_images) ** 2, 1)            if np.min(dist) < 4e-3:                # don't show points that are too close                continue            shown_images = np.r_[shown_images, [X[i]]]            imagebox = offsetbox.AnnotationBbox(                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),                X[i])            ax.add_artist(imagebox)    plt.xticks([]), plt.yticks([])    if title is not None:        plt.title(title)#----------------------------------------------------------------------# Plot images of the digitsn_img_per_row = 20img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))for i in range(n_img_per_row):    ix = 10 * i + 1    for j in range(n_img_per_row):        iy = 10 * j + 1        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))plt.imshow(img, cmap=plt.cm.binary)plt.xticks([])plt.yticks([])plt.title('A selection from the 64-dimensional digits dataset')#----------------------------------------------------------------------# Projection on to the first 2 principal componentsprint("Computing PCA projection")t0 = time()X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)plot_embedding(X_pca,               "Principal Components projection of the digits (time %.2fs)" %               (time() - t0))#----------------------------------------------------------------------# Projection on to the first 2 linear discriminant componentsprint("Computing LDA projection")X2 = X.copy()X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertiblet0 = time()X_lda = lda.LDA(n_components=2).fit_transform(X2, y)plot_embedding(X_lda,               "Linear Discriminant projection of the digits (time %.2fs)" %               (time() - t0))#----------------------------------------------------------------------# t-SNE embedding of the digits datasetprint("Computing t-SNE embedding")tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)t0 = time()X_tsne = tsne.fit_transform(X)plot_embedding(X_tsne,               "t-SNE embedding of the digits (time %.2fs)" %               (time() - t0))plt.show()

附录：Manifold Learning 可以参考sklearn的文档

0 0