词向量源码解析：（5.3）ngram2vec源码解析之corpus2vocab

来源：互联网发布：odds ends相机数据编辑：程序博客网时间：2024/06/05 06:06

在hyperwords工具包中，我们看到了用python几行代码就可以建立一个词典。相对于C用几百行实在是方便太多了。但是hyperwords中建立词典的过程却没有考虑主机的内存情况。建立词典的过程中很容易出现内存不够的情况。在word2vec和GloVe中都有ReduceVocab这样的函数，在建立词典中扔掉一些低频词。这样做的会导致最后的词典不全，但是在实际应用中没有什么问题。word2vec和GloVe中虽然有ReduceVocab，但是没有内存的监控，使得我们需要凭借经验去决定ReduceVocab的时机。

ngram2vec中的corpus2vocab对内存进行了监控，能让我们利用有限的内存得到尽可能完整的词典。这个代码会在内存快要不够的时候reduce词典。根据经验这个机制还是比较重要的。比如当我们引入ngram特征甚至其他更加复杂的特征的时候。词典的大小会非常的大，远远超过了内存的大小。这时我们需要能充分的利用内存，尽量能正确的去掉低频单词。下面看一下代码，这个代码考虑了ngram特征。getNgram函数从句子中的某个位置读取n元组（ngram）。

def getNgram(tokens, pos, gram): #uni:gram=1 bi:gram=2 tri:gram=3//tokens是单词列表，pos是我们要从列表的位置读取token，gram表示我们读取几元组

if pos < 0://首先读取的ngram不能越界
return None
if pos + gram > len(tokens):
return None
token = tokens[pos]//ngram的第一个单词
for i in xrange(1, gram)://ngram后几个单词，用@$连接

token = token + "@$" + tokens[pos + i]
return token

然后看一下main函数的变量。getsizeof()函数能得到变量占用的内存，不过对于词典这样的数据结构，其中的字符串占用的内存是不算在getsizeof(vocab)，我们需要单独对字符串占用的内存进行统计。

ngram = int(args['--ngram'])//考虑到几阶的ngram
memory_size = float(args['--memory_size']) * 1000**3//内存的字节数1g=1000*3byte
min_count = int(args['--min_count'])//低频词阈值
vocab = {} # vocabulary (stored by dictionary)//词典
reduce_thr = 1 # remove low-frequency words when memory is insufficient//reduce机制，内存满了以后先去掉频数为1的，再满的话去掉频数为2的
memory_size_used = 0 # size of memory used by keys & values in dictionary (not include dictionary itself) //估计的目前词典中字符串占用的内存大小

下面开始构建词典，比hyperwords复杂一些，因为我们考虑了内存不足的情况。但是代码和C语言比也简单多了。

with open(args['<corpus>']) as f:
tokens_num = 0
print str(tokens_num/1000**2) + "M tokens processed."
for line in f:
print "\x1b[1A" + str(tokens_num/1000**2) + "M tokens processed."//用来打印处理的单词个数
tokens = line.strip().split()//对一行分词
tokens_num += len(tokens)
for pos in xrange(len(tokens)): //对一行tokens中每个位置遍历
for gram in xrange(1, ngram+1)://在每个位置上面考虑ngram
token = getNgram(tokens, pos, gram)//得到ngram
if token is None :
continue
if token not in vocab ://如果ngram不在词典中，就要加入词典，可能导致内存不够
memory_size_used += getsizeof(token)
vocab[token] = 1
if memory_size_used + getsizeof(vocab) > memory_size * 0.8: #reduce vocabulary when memory is insufficient//内存不够就需要reduce，0.8是经验值，占用总内存等于字符串占用内存加上词典占用内存
reduce_thr += 1//reduce的阈值提高1，后面是代码是大于等于，我们删掉频数为reduce_thr-1以及以下的单词
vocab_size = len(vocab)
vocab = {w: c for w, c in vocab.iteritems() if c >= reduce_thr}//reduce代码用python就一行，把低频词扔掉
memory_size_used *= float(len(vocab)) / vocab_size #estimate the size of memory used//因为去掉了很多单词，所以重新估计字符串占用的内存大小
else:
vocab[token] += 1//如果在词典中就频数加1

最后对词典去掉低频词以及排序以及写到磁盘，代码很简洁，如果内存大小，可能导致reduce_thr大于预设的min_count。这也是没有办法的事情，已经很充分的利用了内存了

vocab = {w: c for w, c in vocab.iteritems() if c >= min_count} #remove low-frequency words by pre-specified threshold//去掉低频词
vocab = sorted(vocab.iteritems(), key=lambda item: item[1], reverse=True) #sort vocabulary by frequency in descending order//按照词频排序，可能会用到更多的内存
save_count_vocabulary(args['<output>'], vocab)//调用统一的接口去写词典

阅读全文

0 0