word2vec 入门教程

来源:互联网 发布:宏基笔记本怎么样 知乎 编辑:程序博客网 时间:2024/05/20 07:15

word2vec安装全过程如下:

1.下载word2vec源码

注意:很多文章给的方式是 svn checkout http://word2vec.googlecode.com/svn/trunk/。但是天朝的网络,你懂的
所以咱们采取另外一种方式
https://github.com/dav/word2vec
这是github的地址。天朝的GFW再一次显示出牛掰之处,git clone也down不下来。忍不住要爆个粗口,Fxxx。
最后直接在github上把zip包download下来,然后scp到服务器上。

2.瞅瞅里头有啥

[webopa@hive001 word2vec-master]$ tree -L 1.├── bin├── data├── LICENSE├── README.md├── scripts└── src4 directories, 2 files

3.cd 到src中,使用make命令安装

注意:github上下载的zip包里的make源码如下:

1 SCRIPTS_DIR=../scripts2 BIN_DIR=../bin34 CC = gcc5 #The -Ofast might not work with older versions of gcc; in that case, use -O26 CFLAGS = -lm -pthread -O2 -Wall -funroll-loops78 all: word2vec word2phrase distance word-analogy compute-accuracy910 word2vec : word2vec.c11     $(CC) word2vec.c -o ${BIN_DIR}/word2vec $(CFLAGS)12 word2phrase : word2phrase.c13     $(CC) word2phrase.c -o ${BIN_DIR}/word2phrase $(CFLAGS)14 distance : distance.c15     $(CC) distance.c -o ${BIN_DIR}/distance $(CFLAGS)16 word-analogy : word-analogy.c17     $(CC) word-analogy.c -o ${BIN_DIR}/word-analogy $(CFLAGS)18 compute-accuracy : compute-accuracy.c19     $(CC) compute-accuracy.c -o ${BIN_DIR}/compute-accuracy $(CFLAGS)20     chmod +x ${SCRIPTS_DIR}/*.sh2122 clean:23     pushd ${BIN_DIR} && rm -rf word2vec word2phrase distance word-analogy compute-accuracy; popd

网上很多文章都写到要将第6行中-pthread 后面的参数改为-02,github上下载的版本已经是02,所以不需要再修改

4.make顺利完成以后,原来为空的bin目录下多了以下文件:

[webopa@hive001 word2vec-master]$ tree binbin├── compute-accuracy├── distance├── word2phrase├── word2vec└── word-analogy0 directories, 5 files

就是我们编译的结果

5.运行一个demo

cd到scripts下面,查看demo-classes.sh

1 DATA_DIR=../data2 SRC_DIR=../src3 BIN_DIR=../bin45 TEXT_DATA=$DATA_DIR/text86 CLASSES_DATA=$DATA_DIR/classes.txt78 pushd ${SRC_DIR} && make; popd91011 if [ ! -e $CLASSES_DATA ]; then1213   if [ ! -e $TEXT_DATA ]; then14     wget http://mattmahoney.net/dc/text8.zip -O $DATA_DIR/text8.gz15     gzip -d $DATA_DIR/text8.gz -f16   fi17   echo -----------------------------------------------------------------------------------------------------18   echo -- Training vectors...19   time $BIN_DIR/word2vec -train $TEXT_DATA -output $CLASSES_DATA -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 5002021 fi2223 sort $CLASSES_DATA -k 2 -n > $DATA_DIR/classes.sorted.txt24 echo The word classes were saved to file $DATA_DIR/classes.sorted.txt

第一次运行这个脚本时,会去下载text8.gz这个文件
下载完这个文件后,再次运行这个脚本

[webopa@hive001 scripts]$ ./demo-classes.sh~/lei.wang/word2vec/word2vec-master/src ~/lei.wang/word2vec/word2vec-master/scriptsgcc word2vec.c -o ../bin/word2vec -lm -pthread -O2 -Wall -funroll-loopsgcc word2phrase.c -o ../bin/word2phrase -lm -pthread -O2 -Wall -funroll-loopsgcc distance.c -o ../bin/distance -lm -pthread -O2 -Wall -funroll-loopsgcc word-analogy.c -o ../bin/word-analogy -lm -pthread -O2 -Wall -funroll-loopsgcc compute-accuracy.c -o ../bin/compute-accuracy -lm -pthread -O2 -Wall -funroll-loopschmod +x ../scripts/*.sh~/lei.wang/word2vec/word2vec-master/scripts------------------------------------------------------------------------------------------------------- Training vectors...Starting training using file ../data/text8Vocab size: 71291Words in train file: 16718843Alpha: 0.000122  Progress: 99.58%  Words/thread/sec: 14.48kreal    2m52.606suser    20m24.567ssys 0m1.328sThe word classes were saved to file ../data/classes.sorted.txt
0 0
原创粉丝点击