词向量之加载word2vec和glove

来源:互联网 发布:qq飞车雷诺数据 编辑:程序博客网 时间:2024/05/16 13:38

1 Google用word2vec预训练了300维的新闻语料的词向量googlenews-vecctors-negative300.bin,解压后3.39个G。


可以用gensim加载进来,但是需要内存足够大。

#加载Google训练的词向量import gensimmodel = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)print(model['love'])


2 用Glove预训练的词向量也可以用gensim加载进来,只是在加载之前要多做一步操作,代码参考。

Glove300维的词向量有5.25个G。

# 用gensim打开glove词向量需要在向量的开头增加一行:所有的单词数 词向量的维度import gensimimport osimport shutilimport hashlibfrom sys import platform#计算行数,就是单词数def getFileLineNums(filename):f = open(filename, 'r')count = 0for line in f:count += 1return count#Linux或者Windows下打开词向量文件,在开始增加一行def prepend_line(infile, outfile, line):with open(infile, 'r') as old:with open(outfile, 'w') as new:new.write(str(line) + "\n")shutil.copyfileobj(old, new)def prepend_slow(infile, outfile, line):with open(infile, 'r') as fin:with open(outfile, 'w') as fout:fout.write(line + "\n")for line in fin:fout.write(line)def load(filename):num_lines = getFileLineNums(filename)gensim_file = 'glove_model.txt'gensim_first_line = "{} {}".format(num_lines, 300)# Prepends the line.if platform == "linux" or platform == "linux2":prepend_line(filename, gensim_file, gensim_first_line)else:prepend_slow(filename, gensim_file, gensim_first_line)model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)load('glove.840B.300d.txt')
生成的glove_model.txt就是可以直接用gensim打开的模型。



0 1
原创粉丝点击