python之word2vec实战学习

来源:互联网 发布:安智市场软件下载 编辑:程序博客网 时间:2024/05/22 12:59

1.简介

    在这里我是用的是python下的gensim模块中提供的word2vec,除此之外还有其他语言版本的。word2vec简单的来说就是:用来将单词文本转化成计算机可以计算使用的向量的一个转化工具,利用向量的距离或者夹角等可以反映出来向量在空间上的相似度的方法来得到文本在语义上的相似度。当然我们都知道在NLP领域最简单最常用的就是词袋模型了,这里的向量维数很大,包含了整个的语料库,一篇文章中存在哪个单词哪个单词的向量值就是1,否则为0,这样最终得到的一个庞大的矩阵是极为稀疏的。虽然在很多任务上表现出来的性能和效果都是很不错的,但是还是存在很大不足的,比如说:这样的表示无法将汉语或者英语文本中的词语的上下文和语境的关联性体现出来,那么在计算的时候自然就不可能很准确了。

    当然之后LDA主题分类算法的提出很大程度上改善了这个情况,因为在这里不再是简单的使用词袋模型来表示文本的内容了,而是采用了一种概率主题词汇的分布作为特征向量,想要多了解这里的话可以自行百度一下还是有很多资料的。

    安装gensim的话直接使用:

    pip install gensim

    或者是: conda install gensim都行(前提是安装了Anaconda)

2.word2vec使用

    词向量的获取一般都是在训练得到语言模型的过程中学习得到的,这个是无法分割的关系。一般常用于学习主要特征包括:词频、tf-idf值、词性等信息。

    gensim的输入是一个包含训练语言模型所需要的巨大的语料库,这个txt文件中存储的是所有语言文本被分词之后的结果,词语之间使用的是空格间隔。这里自然可以想到如果用于训练的预料文本量很大的话,那么得到的模型的表述自然也更加准确,如果你仅仅使用一些很少的语料文本的话得到的结果恐怕都是很差的,我在这里是在网上找到了一个用于bayes纠正拼写的模型训练中使用的一个语料文本big.txt,在博文的最后面会贴出来这个txt文件的内容以供使用。

    简单的实验了一下,感觉还是很有意思的,之后的学习中可能也会经常使用到这个,下面看一下程序和数据:

#!usr/bin/env python#encoding:utf-8import gensimimport logging     '''gensim的简单测试'''    logging.basicConfig(format='%(asctime)s:%(levelname)s: %(message)s', level=logging.INFO)  sentences =gensim.models.word2vec.Text8Corpus("big.txt")  # 加载语料  model =gensim.models.word2vec.Word2Vec(sentences, size=200)  #训练skip-gram模型,默认window=5  print '------------------------------模型为------------------------------------'  print model  print '----------------------remains和remarkable的相似度为---------------------------' print model.similarity("remains", "remarkable")   print '----------------------与actor最相关的50个词汇为---------------------------'actor_model = model.most_similar("actor", topn=50)  # 20个最相关的  print actor_modelfor one_item in actor_model:      print one_item[0], one_item[1]     
结果为为:

2017-05-25 15:45:14,659:INFO: collecting all words and their counts2017-05-25 15:45:14,661:INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types2017-05-25 15:45:15,072:INFO: collected 81421 word types from a corpus of 1095776 raw words and 110 sentences2017-05-25 15:45:15,072:INFO: Loading a fresh vocabulary2017-05-25 15:45:15,191:INFO: min_count=5 retains 17465 unique words (21% of original 81421, drops 63956)2017-05-25 15:45:15,191:INFO: min_count=5 leaves 998994 word corpus (91% of original 1095776, drops 96782)2017-05-25 15:45:15,310:INFO: deleting the raw counts dictionary of 81421 items2017-05-25 15:45:15,322:INFO: sample=0.001 downsamples 42 most-common words2017-05-25 15:45:15,323:INFO: downsampling leaves estimated 746384 word corpus (74.7% of prior 998994)2017-05-25 15:45:15,323:INFO: estimated required memory for 17465 words and 200 dimensions: 36676500 bytes2017-05-25 15:45:15,372:INFO: resetting layer weights2017-05-25 15:45:15,688:INFO: training model with 3 workers on 17465 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=52017-05-25 15:45:16,703:INFO: PROGRESS: at 14.18% examples, 518551 words/s, in_qsize 5, out_qsize 02017-05-25 15:45:17,727:INFO: PROGRESS: at 28.73% examples, 521294 words/s, in_qsize 5, out_qsize 02017-05-25 15:45:18,731:INFO: PROGRESS: at 42.73% examples, 524322 words/s, in_qsize 6, out_qsize 02017-05-25 15:45:19,744:INFO: PROGRESS: at 57.09% examples, 525288 words/s, in_qsize 5, out_qsize 02017-05-25 15:45:20,753:INFO: PROGRESS: at 71.82% examples, 527862 words/s, in_qsize 6, out_qsize 02017-05-25 15:45:21,759:INFO: PROGRESS: at 86.00% examples, 527994 words/s, in_qsize 6, out_qsize 12017-05-25 15:45:22,728:INFO: worker thread finished; awaiting finish of 2 more threads2017-05-25 15:45:22,740:INFO: worker thread finished; awaiting finish of 1 more threads2017-05-25 15:45:22,741:INFO: worker thread finished; awaiting finish of 0 more threads2017-05-25 15:45:22,741:INFO: training on 5478880 raw words (3730989 effective words) took 7.1s, 529131 effective words/s------------------------------模型为------------------------------------Word2Vec(vocab=17465, size=200, alpha=0.025)----------------------remains和remarkable的相似度为---------------------------0.921331645403----------------------与actor最相关的50个词汇为---------------------------2017-05-25 15:45:22,742:INFO: precomputing L2-norms of word weight vectors[(u'21', 0.9900971055030823), (u'14', 0.988671600818634), (u'61', 0.9884299635887146), (u'herb', 0.9877204895019531), (u'tastes', 0.9873628616333008), (u'120', 0.9872018098831177), (u'John,', 0.9868811368942261), (u'roughly', 0.9868083000183105), (u'158,', 0.9866986274719238), (u'privately', 0.9865480065345764), (u'th', 0.9864736795425415), (u'policy,', 0.9864664077758789), (u'kick', 0.9863194823265076), (u'3,', 0.9862992167472839), (u'd', 0.986174464225769), (u'scatter', 0.9861586689949036), (u'decline', 0.9861509799957275), (u'chronic,', 0.9860124588012695), (u'Fort,', 0.9857746958732605), (u"Roylott's", 0.9856908321380615), (u'adenoma,', 0.9856142401695251), (u'olive', 0.9855972528457642), (u'punch', 0.9855948686599731), (u'courtesy', 0.9853811264038086), (u'451', 0.9853434562683105), (u'adverse', 0.9853295683860779), (u'plausible', 0.9851090312004089), (u'sack', 0.9850969314575195), (u'154', 0.9850865006446838), (u'412', 0.9850860834121704), (u'368', 0.9850307106971741), (u'classical', 0.9850298166275024), (u'thorough', 0.9849967360496521), (u'27', 0.9849128723144531), (u'Harrison', 0.9849109053611755), (u'437', 0.9848912954330444), (u'James,', 0.9848843812942505), (u'121', 0.9848609566688538), (u"Klapp's", 0.9848318099975586), (u'87', 0.984637975692749), (u'republican', 0.9846112132072449), (u'supplement', 0.9845895767211914), (u'isolation', 0.9845013618469238), (u'blast', 0.984484851360321), (u'Suction', 0.9844322204589844), (u'pile', 0.9843543767929077), (u'secular', 0.9843310117721558), (u'94', 0.9842190742492676), (u'convict', 0.9842174649238586), (u'147', 0.9842002391815186)]21 0.99009710550314 0.98867160081961 0.988429963589herb 0.987720489502tastes 0.987362861633120 0.987201809883John, 0.986881136894roughly 0.986808300018158, 0.986698627472privately 0.986548006535th 0.986473679543policy, 0.986466407776kick 0.9863194823273, 0.986299216747d 0.986174464226scatter 0.986158668995decline 0.986150979996chronic, 0.986012458801Fort, 0.985774695873Roylott's 0.985690832138adenoma, 0.98561424017olive 0.985597252846punch 0.98559486866courtesy 0.985381126404451 0.985343456268adverse 0.985329568386plausible 0.9851090312sack 0.985096931458154 0.985086500645412 0.985086083412368 0.985030710697classical 0.985029816628thorough 0.9849967360527 0.984912872314Harrison 0.984910905361437 0.984891295433James, 0.984884381294121 0.984860956669Klapp's 0.98483180999887 0.984637975693republican 0.984611213207supplement 0.984589576721isolation 0.984501361847blast 0.98448485136Suction 0.984432220459pile 0.984354376793secular 0.98433101177294 0.984219074249convict 0.984217464924147 0.984200239182

big.txt存在了百度云盘中:

http://pan.baidu.com/s/1c2DzqYs

原创粉丝点击