NLP06-Gensim源码简析[字典]

来源：互联网发布：mac 蓝牙解锁编辑：程序博客网时间：2024/06/06 00:58

摘要：分析Ginsim中的字典源代码，包括分析Dictionary类与HashDictionary类，对比两类的异同点，重点学习doc2bow函数，一句解读其中的意思。

0 前置

NLP05-Gensim源码[包与接口]：http://blog.csdn.net/ld326/article/details/78379449

1. 例子

from gensim import corporatexts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]print('dictionary:')dictionary = corpora.Dictionary(texts)print(dictionary.token2id)print('hash dictionary:')hashDic = corpora.HashDictionary(texts)print(hashDic.token2id)运行结果：dictionary:{'trees': 9, 'minors': 11, 'user': 4, 'eps': 8, 'system': 7, 'interface': 2, 'time': 3, 'human': 0, 'computer': 1, 'response': 5, 'survey': 6, 'graph': 10}hash dictionary:{'trees': 23844, 'minors': 15001, 'user': 12736, 'eps': 31049, 'system': 5798, 'interface': 12466, 'time': 29104, 'human': 31002, 'computer': 10608, 'response': 5232, 'survey': 28591, 'graph': 18451}

2.字典

有两个字典实现，源文件都在corpora包中，可查看http://blog.csdn.net/ld326/article/details/78379449 中的文件结构。

2.1 Dictionary

 主要是<词id,词频>二元组列表，词Id实现的时是用dict的下标来编码的；

2.2 HashDictionary

 主要也是<词id,词频>二元组列表，词Id实现的时是用dict的通过指定hash函数来编码的；

3. 考查字典

其实Dictionary类与HashDictionary类实现起来代码不多.

3.1 Dictionary类

"""This module implements the concept of Dictionary -- a mapping between words and their integer ids.Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the :func:`Dictionary.filter_extremes` method), save/loaded from disk (via :func:`Dictionary.save` and :func:`Dictionary.load` methods), merged with other dictionary (:func:`Dictionary.merge_with`) etc."""

实现了词与词id的映射，实现字典这一概念。
精简：Dictionary.filter_extremes（）
保存：Dictionary.save（）
载入：Dictionary.load（）
合并：Dictionary.merge_with（）

3.2 HashDictionary类:
“””
This module implements the "hashing trick" <http://en.wikipedia.org/wiki/Hashing-Trick>_ –
a mapping between words and their integer ids using a fixed, static mapping. The
static mapping has a constant memory footprint, regardless of the number of word-types (features)
in your corpus, so it’s suitable for processing extremely large corpora.

The ids are computed as hash(word) % id_range, where hash is a user-configurable
function (adler32 by default). Using HashDictionary, new words can be represented immediately,
without an extra pass through the corpus to collect all the ids first. This is another
advantage: HashDictionary can be used with non-repeatable (once-only) streams of documents.

A disadvantage of HashDictionary is that, unlike plain :class:Dictionary, several words may map
to the same id, causing hash collisions. The word<->id mapping is no longer a bijection.
“””
对于Diction类与HashDictionary类对比：
这里写图片描述

4. init()方法

Diction类
def init(self, documents=None, prune_at=2000000)
HashDictionary类
def init(self, documents=None, id_range=32000, myhash=zlib.adler32, debug=True)
这个比较上面那个多了一个hash函数，这个hash函数可以自已来指定的。

5. 核心函数doc2bow

两个类都一样

def doc2bow(self, document, allow_update=False, return_missing=False)"""Convert `document` (a list of words) into the bag-of-words format = listof `(token_id, token_count)` 2-tuples. Each word is assumed to be a**tokenized and normalized** string (either unicode or utf8-encoded). No further preprocessing is done on the words in `document`; apply tokenization, stemming etc. before calling this method.If `allow_update` is set, then also update dictionary in the process: createids for new words. At the same time, update document frequencies -- foreach word appearing in this document, increase its document frequency (`self.dfs`) by one.If `allow_update` is **not** set, this function is `const`, aka read-only."""

把document转化成词袋的形式（即

# 功能：文档转词袋表示def doc2bow(self, document, allow_update=False, return_missing=False):    # 输入：document--输入一个字符串数据；allow_update:是否更新字典；# return_missing:是否返回没有匹配上的字符串    # 输出：result字典，是document转完之后的bow数据。如果return_missing为# True，这个missing变量也返回。    #      result = {}    missing = {}    # 文档排序，用来统计词前作一个预处理    document = sorted(document)  # convert the input to plain list (needed below)    # 统计每个词在输入文档中出现多少次    for word_norm, group in itertools.groupby(document):        frequency = len(list(group))  # how many times does this word appear in the input document        # 把词转成hash编码        tokenid = self.restricted_hash(word_norm)        # 更新词频        result[tokenid] = result.get(tokenid, 0) + frequency        if self.debug:            # increment document count for each unique token that appeared in the document            self.dfs_debug[word_norm] = self.dfs_debug.get(word_norm, 0) + 1# 更新词典    if allow_update or self.allow_update: # 词典文档数        self.num_docs += 1 # 词典文档的所有词数统计[基于文档的]        self.num_pos += len(document)         # 词典文档的所有词数统计[基于去重后的]        self.num_nnz += len(result)        if self.debug:            # increment document count for each unique tokenid that appeared in the document            # done here, because several words may map to the same tokenid            for tokenid in iterkeys(result):                self.dfs[tokenid] = self.dfs.get(tokenid, 0) + 1    # return tokenids, in ascending id order    # 对Id进行排序返回    result = sorted(iteritems(result))    if return_missing:        return result, missing    else:        return result

另外，这里面用到一个hash函数：

# 计算hash编码，myhash函数是在构造函数时赋值的，可以自己定义与选择一个合适的函数。def restricted_hash(self, token):    """    Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons.    """    h = self.myhash(utils.to_utf8(token)) % self.id_range    if self.debug:        self.token2id[token] = h        self.id2token.setdefault(h, set()).add(token)    return h

7. 后记

 字典的实现没有太复杂的算法，代码不是太多，看起来还可以。

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78386012】

阅读全文

1 0