LargeVis可视化技术学习

来源：互联网发布：js display:none 编辑：程序博客网时间：2024/04/30 23:07

大图可视化一直是大数据可视化领域的一个关键技术，当前有各种办法，但是今年出来了一个LargeVis的技术，因此对这个技术进行复现和学习一下。

前面有很多基础理论，如基本的降维理论，SNE，t-SNE可视化算法等，这些一概略过，想关注理论的可以参考网址：https://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis/

Largevis可视化算法

虽然t-SNE算法和它的改进算法都得到广泛应用，但存在两个不足：一是处理大规模高维数据时，t-SNE的效率显著降低(包括改进后的算法)；二是t-SNE中的参数对不同数据集较为敏感，我们辛辛苦苦的在一个数据集上调好了参数，得到了一个不错的可视化效果，却发现不能在另一个数据集上适用，还得花费大量时间寻找合适的参数。

唐建提出的LargeVis基本思路与t-SNE改进算法大致相同，如下图所示：

LargeVis用到了几个非常漂亮的优化技巧，包括他之前在LINE上的工作。我们分别来看看这些另人耳目一新的优化技术。

1) 高效KNN图构建算法

在t-SNE的改进算法中，高维空间距离相似性我们只考虑与xixi最接近的若干个邻居点，这实质上就是一个构建kNN图的过程。Maaten使用了VP树来构建一个精确的kNN图，但是效率依然堪忧。而LargeVis采用了一种更巧妙的方式，不追求一步到位，先近似再提高准确率。

在预备知识部分我们说到，构建kNN图一般有三类方法，LargeVis的做法是将第一类方法和第三类方法相结合。具体来说，第一步先利用随机投影树得到一个空间划分，在此基础上寻找每个点的kk近邻，得到一个初步kNN图，这个kNN图不要求完全准确。第二步根据“邻居的邻居可能也是我的邻居”的思想，利用邻居搜索算法寻找潜在的邻居，计算邻居与当前点、邻居的邻居与当前点的距离并放入一个小根堆之中，取距离最小的k个节点作为k近邻，最终得到一个精确的kNN图。

2) 低维空间可视化算法

在低维空间可视化过程中，t-SNE的思路是保证高维空间的距离分布P与低维空间的距离分布Q尽可能接近，用KL距离写出代价函数并求梯度。但是效率问题也很突出，不然Maaten也不会花那么大力气，利用各种树算法优化这个梯度求解的过程。

首先我们知道word2vec中用到了两个模型，CBOW和Skip-gram模型，此外还有众多优化技术，其中一个叫负采样。那么在网络中其实也是相类似的，我们可以把当前中心点视为目标词，其邻居节点视为上下文窗口中出现的词，那么中心点和其邻居节点即构成一个正样本，而中心点与非邻居点构成一个负样本。（Skip-gram模型）

利用负采样和边采样优化之后，LargeVis还用到了异步随机梯度下降来进行训练，这项技术在稀疏图上是非常有效的，因为不同线程采样的边所连接的两个节点很少有重复的，不同线程之间几乎不会产生冲突。从时间复杂度上来看，每一轮随机梯度下降的时间复杂度为O(sM)，其中M是负样本个数，s是低维空间的维数(2或3)，随机梯度的步数通常又与点节数量N成正比，因此总的时间复杂度为O(sMN)。从这里可以知道，LargeVis的时间复杂度是与网络中的节点数量呈线性关系的。

从可视化效果上看，LargeVis与t-SNE是差不多的，在某些数据集上略胜一筹，但是从训练时间上看，LargeVis比t-SNE高效太多。

1. 总结

从SNE到t-SNE再到LargeVis，SNE奠定了一个非常牢靠的基础，却遗留了一个棘手的拥挤问题；t-SNE用tt分布巧妙的解决了拥挤问题，并采用了多种树算法改进算法效率；LargeVis在t-SNE改进算法的基础上，参考了近年来较为新颖的优化技巧，如随机投影树、负采样、边采样(实质也是负采样)等，直接将训练的时间复杂度降至线性级。在表示学习和深度学习如此火热的年代，任何一种经典的模型或方法都有可能在其他领域发挥不可思议的妙用。

一、实验验证与分析

1. 数据与代码下载

代码下载：https://github.com/lferry007/LargeVis

数据下载：

2. readme.md

代码共21M,量还是比较大，作者官方制作版本。可用于可视化大规模高维度数据。现在也支持可视化高维度特征向量和网络。同时包含了一个高效的构造KNN图的算法实现（寻找最近邻k）

联系作者：Jian Tang, tangjianpku@gmail.com

1) ##Install依赖包

Both C++ source codes and Python wrapper are provided on Linux, OS X andWindows. To install the package,external packages arerequired, including[GSL(GNU ScientificLibrary)](http://www.gnu.org/software/gsl/) on Linux and OS X or [BOOST](http://www.boost.org/) on Windows forgenerating random numbers.

2) ####Linux安装

编译命令：Compile the source files via:

g++ LargeVis.cpp main.cpp-o LargeVis -lm -pthread -lgsl -lgslcblas -Ofast -march=native -ffast-math

To install the Pythonwrapper, modify ```setup.py``` to make sure that theGSL path is correctly set and then run ```sudo python setup.pyinstall```.

3) ####Windows安装

To compile the sourcefiles, use Microsoft Visual Studio, where you need toset the BOOST path.

To install the Pythonwrapper, modify ```setup.py``` to make sure that the BOOST path is correctlyset and then run ```python setup.py install```.

4) 输入文件格式：

多维向量：

the format of input fileshould be as follows: the first line specifies the number of feature vectorsand the dimensionality (500 vectors with 10 dimensions in the followingexample), and each of the next 500 lines describes one feature vector with 10float numbers.

注意没有节点编号

```

500 10

1.0 1.0 1.0 1.0 1.0 1.0 1.01.0 1.0 1.0

...

1.0 1.0 1.0 1.0 1.0 1.0 1.01.0 1.0 1.0

```

网络边文件：有向加权图

For networks, each line ofthe input file is a DIRECTED edge. For each undirected edge, users must use TWODIRECTED edges to represent it. For example,

```

0 1 2.5

1 0 2.5

2 5 4.5

5 2 4.5

3 10 3.0

...

495 498 1.5

```

5) 运行参数分析

For C++ executable file,

```

./LargeVis -input -output

```

or for Python,

```

python LargeVis_run.py -input -output

```

其他可选参数：

`-fea`: specify whether the input file ishigh-dimensional feature vectors (1) or networks (0). Default is 1.

/////设定是向量（1）还是网络（0）

* `-threads`: Number of threads. Default is8.

/////线程数

* `-outdim`: The lower dimensionalityLargesVis learns for visualization (usually 2 or 3). Default is 2.

//////可视化表示输出维度：2或者3

* `-samples`: Number of edge samples forgraph layout (in millions). Default is set to ```data size / 100``` (million).

* `-prop`: Number of times for neighborpropagations in the state of K-NNG construction, usually less than 3. Defaultis 3.

////////邻居传播次数，与KNN求得邻居精度有关，时间、效率的折中。

* `-alpha`: Initial learning rate. Defaultis 1.0.

/////学习率，梯度下降用

* `-trees`: Number of random-projectiontrees used for constructing K-NNG. 50 is sufficient for most cases unless youare dealing with very large datasets (e.g. data size over 5 million), and lesstrees are suitable for smaller datasets. Default is set according to the datasize.

////////KNN图初始的随机映射树个数，根据数据集大小选择

* `-neg`: Number of negative samples usedfor negative sampling. Default is 5.

//////负采样个数，5个足够

* `-neigh`: Number of neighbors (K) inK-NNG, which is usually set as three times of perplexity. Default is 150.

////////KNN中K的大小

* `-gamma`: The weights assigned tonegative edges. Default is 7.

//////负采样边的给定权重

* `-perp`: The perplexity used for decidingedge weights in K-NNG. Default is 50.

1. 代码仿真

我们在window下实验：

1) 编译，生成window版

在vs2015中新建项目->visual c++->空项目

导入文件LargeVis.cpp main.cpp依次编译文件

错误一：

严重性

代码

说明

项目

文件

行

禁止显示状态

错误

C1083

无法打开包括文件: “boost/thread.hpp”: No such file or directory

LargeVis

d:\liuzhengming\论文学习\20170416largevis可视化技术学习\largevis-master\windows\largevis.h

没有指定boost路径

分析：Boost库是一个可移植、提供源代码的C++库，作为标准库的后备，是C++标准化进程的开发引擎之一。

下载boost库：http://www.boost.org/

2) boost安装使用方法介绍：

http://blog.csdn.net/lp310018931/article/details/47791143

具体步骤，添加到VS库

a) 解压boost，双击bootstrap.bat文件，生成bjam.exe

b) 直接双击bjam.exe，或运行如下命令

bjam--toolset=msvc --build-type=complete stage

c) 在boost根目录下生成bin.v2和stage两个文件夹，其中bin.v2下是生成的中间文件，大小在2.7G左右，可以直接删除。stage下才是生成的dll和lib文件。

d) vs配置（视图-属性页）

视图->属性管理器->当前项目->Debug|Win32->Microsoft.Cpp.Win32.user双击

在弹出的属性对话框中:

通用属性->VC++目录："包含目录": boost的根目录，例: D:\Program Files (x86)\boost1.63.0

"库目录": stage下的链接库目录，例: D:\Program Files(x86)\boost1.63.0\stage\lib

通用属性->链接器->常规:"附加库目录":同上面的"库目录"，例: D:\Program Files(x86)\boost1.63.0\stage\lib

错误二：

严重性

代码

说明

项目

文件

行

禁止显示状态

错误

C4996

'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.

LargeVis

d:\liuzhengming\论文学习\20170416largevis可视化技术学习\largevis-master\windows\annoy\annoylib.h

307

C4996 处理：

http://blog.csdn.net/yf210yf/article/details/7854042

项目|属性|配置属性|C/C++|命令行|附加选项,加入【/D"_CRT_SECURE_NO_DEPRECATE" 】(注：加入中括号中完整的内容)

参考网址：https://www.zhihu.com/question/36580981

错误三：

严重性

代码

说明

项目

文件

行

禁止显示状态

错误

LNK2019

无法解析的外部符号 "void * __cdecl mmap(void *,unsigned int,int,int,int,__int64)" (?mmap@@YAPAXPAXIHHH_J@Z)，该符号在函数 "public: virtual bool __thiscall AnnoyIndex<int,float,struct Euclidean,struct Kiss64Random>::load(char const *)" (?load@?$AnnoyIndex@HMUEuclidean@@UKiss64Random@@@@UAE_NPBD@Z) 中被引用

LargeVis

C:\Users\ndscbigdata4\Documents\Visual Studio 2015\Projects\Project1\Project1\LargeVis.obj

将ANNOY文件夹中四个文件加入工程

再次编译生成release x86 LargeVis.exe

3) 安装Python wrapper：生成LargeVis-Python模块

安装(构建)Python的LargeVis模块：

首先修改boost路径如下

include_dirs =['D:/Program Files (x86)/boost1.63.0'],

library_dirs =['D:/Program Files (x86)/boost1.63.0/stage/lib'],

注意‘/’，不能‘\’,之后的问题可能都是这个造成的。。。。2222223.

安装pythonsetup.py install

piplist 查看是否安装成功LargeVis模块

///////////////////////如果没有遇到问题的话，以下就忽略吧。

错误四：error: Unable to findvcvarsall.bat

参考资料：

http://www.educity.cn/wenda/356834.html

http://blog.csdn.net/secretx/article/details/17472107

修改<python安装目录>\Lib\distutils\msvc9compiler.py，243行

toolskey = "VS130COMNTOOLS"

%%130代表vs2014,根据自己电脑安装版本，减一

错误五：Indentation Error: unindent does notmatch any outer

Indentation level

参考资料：

http://www.crifan.com/python_syntax_error_indentationerror/comment-page-1/

Python源码修改的符号没有对齐，可通过notepad->视图 -> 显示符号 -> 显示空格与制表符查看所有符号

找到出错的行，对比分析就知道哪里出错了。

错误六：

LargeVis.h(10):fatal error C1083: 无法打开包括文件: “boost/thread.hpp”: No such file or directory

没有找到boost路径，尝试了无数种方法未果。。。最终我服了，彻底的服了。。。

路径D:\Program Files (x86)\boost1.63.0不能识别

需要改为：D:/Program Files (x86)/boost1.63.0

include_dirs =['D:/Program Files (x86)/boost1.63.0'],

library_dirs =['D:/Program Files (x86)/boost1.63.0/stage/lib'],

///////////////////////////////错误分割线222223

4) 代码试运行

错误六：运行错误

python LargeVis_run.py -input mnist_vec784D/mnist_vec784D.txt-output mnist_vec784D/mnist_vec2D.txt -threads 16

然后Python.exe崩溃了。。。

Thisapplication has requested the Runtime to terminate it in an unusual way.

Pleasecontact the application's support team for more information.

预估原因可能是代码不规范。

我们尝试在一个10个节点的256维向量上运行，运行时间不短，结果如下：

由于实验数据的不同，可视化形式各不相同，不过作者的配图还是非常好的，可以看到各个网络节点相对聚类中心能够明显看到。

0 0

LargeVis可视化技术学习

Largevis可视化算法

1) 高效KNN图构建算法

2) 低维空间可视化算法

1. 总结

一、 实验验证与分析

1. 数据与代码下载

2. readme.md

1) ##Install依赖包

2) ####Linux安装

3) ####Windows安装

4) 输入文件格式：

5) 运行参数分析

1. 代码仿真

1) 编译，生成window版

2) boost安装使用方法介绍：

3) 安装Python wrapper：生成LargeVis-Python模块

4) 代码试运行

一、实验验证与分析