word2vec python 接口安装使用

来源：互联网发布：逆战刷nz点软件免费版编辑：程序博客网时间：2024/05/22 13:39

https://github.com/danielfrg/word2vec

Installation

I recommend the Anaconda python distribution

pip install word2vec

Wheel: Wheels packages for OS X and Windows are provided on Pypi on a best effort sense. The code is quite easy to compile so consider using: --no-use-wheel on Linux and OS X.

Linux: There is no wheel support for linux so you have to compile the C code. The only requirement is gcc. You can override the compilation flags if needed: CFLAGS='-march=corei7' pip install word2vec

Windows: Very experimental support based this win32 port

%load_ext autoreload%autoreload 2

word2vec

This notebook is equivalent to demo-word.sh, demo-analogy.sh, demo-phrases.sh and demo-classes.sh from Google.

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [2]:

import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [3]:

word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)

[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']Starting training using file /Users/drodriguez/Downloads/text8Words processed: 17000K     Vocab size: 4399K  Vocab size (unigrams + bigrams): 2419827Words in train file: 17005206

This will create a text8-phrases that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the origial data as input for word2vec.

Train the model using the word2phrase output.

In [4]:

word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8-phrasesVocab size: 98331Words in train file: 15857306Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 286.52k

That generated a text8.bin file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

In [5]:

word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8Vocab size: 71291Words in train file: 16718843Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 287.55k

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions

In [1]:

import word2vec

Import the word2vec binary file created above

In [2]:

model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array

In [3]:

model.vocab

Out[3]:

array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'],       dtype='<U78')

Or take a look at the whole matrix

In [4]:

model.vectors.shape

Out[4]:

(98331, 100)

In [5]:

model.vectors

Out[5]:

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,         0.10955409,  0.00693387],       [ 0.1220774 ,  0.04939618,  0.09545057, ..., -0.00804222,        -0.05441621, -0.10076696],       [ 0.16844609,  0.03734054,  0.22085373, ...,  0.05854521,         0.04685341,  0.02546694],       ...,        [-0.06760896,  0.03737842,  0.09344187, ...,  0.14559349,        -0.11704484, -0.05246212],       [ 0.02228479, -0.07340827,  0.15247506, ...,  0.01872172,        -0.18154132, -0.06813737],       [ 0.02778879, -0.06457976,  0.07102411, ..., -0.00270281,        -0.0471223 , -0.135444  ]])

We can retreive the vector of individual words

In [6]:

model['dog'].shape

Out[6]:

(100,)

In [7]:

model['dog'][:10]

Out[7]:

array([ 0.05753701,  0.0585594 ,  0.11341395,  0.02016246,  0.11514406,        0.01246986,  0.00801256,  0.17529851,  0.02899276,  0.0203866 ])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [8]:

indexes, metrics = model.cosine('socks')indexes, metrics

Out[8]:

(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]), array([ 0.8375354 ,  0.83590846,  0.82818749,  0.82533614,  0.82278399,         0.81476386,  0.8139092 ,  0.81253798,  0.8105933 ,  0.80850171]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word

Its possible to get the words of those indexes

In [9]:

model.vocab[indexes]

Out[9]:

array([u'hairy', u'pumpkin', u'gravy', u'nosed', u'plum', u'winged',       u'bock', u'petals', u'biscuits', u'striped'],       dtype='<U78')

There is a helper function to create a combined response: a numpy record array

In [10]:

model.generate_response(indexes, metrics)

Out[10]:

rec.array([(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809),       (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071),       (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592),       (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767),       (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)],       dtype=[(u'word', '<U78'), (u'metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [11]:

model.generate_response(indexes, metrics).tolist()

Out[11]:

[(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809), (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071), (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592), (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767), (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)]

Phrases

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases"

In [12]:

indexes, metrics = model.cosine('los_angeles')model.generate_response(indexes, metrics).tolist()

Out[12]:

[(u'san_francisco', 0.886558000570455), (u'san_diego', 0.8731961018831669), (u'seattle', 0.8455603712285231), (u'las_vegas', 0.8407843553947962), (u'miami', 0.8341796009062884), (u'detroit', 0.8235412519780195), (u'cincinnati', 0.8199138493085706), (u'st_louis', 0.8160655356728751), (u'chicago', 0.8156786240847214), (u'california', 0.8154244925085712)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [13]:

indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)indexes, metrics

Out[13]:

(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826,  648, 1426]), array([ 0.2917969 ,  0.27353295,  0.26877692,  0.26596514,  0.26487509,         0.26428581,  0.26315492,  0.26261258,  0.26136635,  0.26099078]))

In [14]:

model.generate_response(indexes, metrics).tolist()

Out[14]:

[(u'queen', 0.2917968955611075), (u'prince', 0.27353295205311695), (u'empress', 0.2687769174818083), (u'monarch', 0.2659651399832089), (u'regent', 0.26487508713026797), (u'wife', 0.2642858109968327), (u'aragon', 0.2631549214361766), (u'throne', 0.26261257728511833), (u'emperor', 0.2613663460665488), (u'bishop', 0.26099078142148696)]

Clusters

In [15]:

clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [16]:

clusters['dog']

Out[16]:

We can see get all the words grouped on an specific cluster

In [17]:

clusters.get_words_on_cluster(90).shape

Out[17]:

(221,)

In [18]:

clusters.get_words_on_cluster(90)[:10]

Out[18]:

array(['along', 'together', 'associated', 'relationship', 'deal',       'combined', 'contact', 'connection', 'bond', 'respect'], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [19]:

model.clusters = clusters

In [20]:

indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)

In [21]:

model.generate_response(indexes, metrics).tolist()

Out[21]:

[(u'berlin', 0.32333651414395953, 20), (u'munich', 0.28851564633559, 20), (u'vienna', 0.2768927258877336, 12), (u'leipzig', 0.2690537010929304, 91), (u'moscow', 0.26531859560322785, 74), (u'st_petersburg', 0.259534503067277, 61), (u'prague', 0.25000637367753303, 72), (u'dresden', 0.2495974800117785, 71), (u'bonn', 0.24403155303236473, 8), (u'frankfurt', 0.24199720792200027, 31)]

In [ ]:

0 0