NLP06-Gensim源码简析[字典]
来源:互联网 发布:mac 蓝牙解锁 编辑:程序博客网 时间:2024/06/06 00:58
摘要:分析Ginsim中的字典源代码,包括分析Dictionary类与HashDictionary类,对比两类的异同点,重点学习doc2bow函数,一句解读其中的意思。
0 前置
NLP05-Gensim源码[包与接口]:http://blog.csdn.net/ld326/article/details/78379449
1. 例子
from gensim import corporatexts = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]print('dictionary:')dictionary = corpora.Dictionary(texts)print(dictionary.token2id)print('hash dictionary:')hashDic = corpora.HashDictionary(texts)print(hashDic.token2id)运行结果:dictionary:{'trees': 9, 'minors': 11, 'user': 4, 'eps': 8, 'system': 7, 'interface': 2, 'time': 3, 'human': 0, 'computer': 1, 'response': 5, 'survey': 6, 'graph': 10}hash dictionary:{'trees': 23844, 'minors': 15001, 'user': 12736, 'eps': 31049, 'system': 5798, 'interface': 12466, 'time': 29104, 'human': 31002, 'computer': 10608, 'response': 5232, 'survey': 28591, 'graph': 18451}
2.字典
有两个字典实现, 源文件都在corpora包中,可查看http://blog.csdn.net/ld326/article/details/78379449 中的文件结构。
2.1 Dictionary
主要是<词id,词频>二元组列表,词Id实现的时是用dict的下标来编码的;
2.2 HashDictionary
主要也是<词id,词频>二元组列表,词Id实现的时是用dict的通过指定hash函数来编码的;
3. 考查字典
其实Dictionary类与HashDictionary类实现起来代码不多.
3.1 Dictionary类
"""This module implements the concept of Dictionary -- a mapping between words and their integer ids.Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the :func:`Dictionary.filter_extremes` method), save/loaded from disk (via :func:`Dictionary.save` and :func:`Dictionary.load` methods), merged with other dictionary (:func:`Dictionary.merge_with`) etc."""
实现了词与词id的映射,实现字典这一概念。
精简:Dictionary.filter_extremes()
保存:Dictionary.save()
载入:Dictionary.load()
合并:Dictionary.merge_with()
3.2 HashDictionary类:
“””
This module implements the "hashing trick" <http://en.wikipedia.org/wiki/Hashing-Trick>
_ –
a mapping between words and their integer ids using a fixed, static mapping. The
static mapping has a constant memory footprint, regardless of the number of word-types (features)
in your corpus, so it’s suitable for processing extremely large corpora.
The ids are computed as hash(word) % id_range
, where hash
is a user-configurable
function (adler32 by default). Using HashDictionary, new words can be represented immediately,
without an extra pass through the corpus to collect all the ids first. This is another
advantage: HashDictionary can be used with non-repeatable (once-only) streams of documents.
A disadvantage of HashDictionary is that, unlike plain :class:Dictionary
, several words may map
to the same id, causing hash collisions. The word<->id mapping is no longer a bijection.
“””
对于Diction类与HashDictionary类对比:
4. init()方法
Diction类
def init(self, documents=None, prune_at=2000000)
HashDictionary类
def init(self, documents=None, id_range=32000, myhash=zlib.adler32, debug=True)
这个比较上面那个多了一个hash函数,这个hash函数可以自已来指定的。
5. 核心函数doc2bow
两个类都一样
def doc2bow(self, document, allow_update=False, return_missing=False)"""Convert `document` (a list of words) into the bag-of-words format = listof `(token_id, token_count)` 2-tuples. Each word is assumed to be a**tokenized and normalized** string (either unicode or utf8-encoded). No further preprocessing is done on the words in `document`; apply tokenization, stemming etc. before calling this method.If `allow_update` is set, then also update dictionary in the process: createids for new words. At the same time, update document frequencies -- foreach word appearing in this document, increase its document frequency (`self.dfs`) by one.If `allow_update` is **not** set, this function is `const`, aka read-only."""
把document转化成词袋的形式(即
# 功能:文档转词袋表示def doc2bow(self, document, allow_update=False, return_missing=False): # 输入:document--输入一个字符串数据;allow_update:是否更新字典;# return_missing:是否返回没有匹配上的字符串 # 输出:result字典,是document转完之后的bow数据。如果return_missing为# True,这个missing变量也返回。 # result = {} missing = {} # 文档排序,用来统计词前作一个预处理 document = sorted(document) # convert the input to plain list (needed below) # 统计每个词在输入文档中出现多少次 for word_norm, group in itertools.groupby(document): frequency = len(list(group)) # how many times does this word appear in the input document # 把词转成hash编码 tokenid = self.restricted_hash(word_norm) # 更新词频 result[tokenid] = result.get(tokenid, 0) + frequency if self.debug: # increment document count for each unique token that appeared in the document self.dfs_debug[word_norm] = self.dfs_debug.get(word_norm, 0) + 1# 更新词典 if allow_update or self.allow_update: # 词典文档数 self.num_docs += 1 # 词典文档的所有词数统计[基于文档的] self.num_pos += len(document) # 词典文档的所有词数统计[基于去重后的] self.num_nnz += len(result) if self.debug: # increment document count for each unique tokenid that appeared in the document # done here, because several words may map to the same tokenid for tokenid in iterkeys(result): self.dfs[tokenid] = self.dfs.get(tokenid, 0) + 1 # return tokenids, in ascending id order # 对Id进行排序返回 result = sorted(iteritems(result)) if return_missing: return result, missing else: return result
另外,这里面用到一个hash函数:
# 计算hash编码,myhash函数是在构造函数时赋值的,可以自己定义与选择一个合适的函数。def restricted_hash(self, token): """ Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons. """ h = self.myhash(utils.to_utf8(token)) % self.id_range if self.debug: self.token2id[token] = h self.id2token.setdefault(h, set()).add(token) return h
7. 后记
字典的实现没有太复杂的算法,代码不是太多,看起来还可以。
【作者:happyprince, http://blog.csdn.net/ld326/article/details/78386012】
- NLP06-Gensim源码简析[字典]
- NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]
- NLP09-Gensim源码简析[TfidfModel]
- NLP10-Gensim源码简析[LsiModel]
- NLP08-Gensim源码简析[ShardedCorpus&UciCorpus&LowCorpus]
- GENSIM
- GENSIM
- gensim
- NLP05-Gensim源码[包与接口]
- 黑客字典源码
- 字典树(模版+源码)
- 简析字典树
- 黑客字典(c#源码)
- 【redis源码分析】字典---dict
- Redis源码解析:03字典
- Redis源码学习四、字典
- Redis源码剖析--字典dict
- redis源码学习之字典
- 堆的动态存储分配之分离适配(Segregated fit)
- SpringAop拦截controller进行日志管理
- 得分
- 简单滤波算法
- 欢迎使用CSDN-markdown编辑器
- NLP06-Gensim源码简析[字典]
- eclipse启动报错:Error:Could not create the Java Virtual Machine Error:A fatal exception has occurred
- Java 字符串的编码解码
- Windows更改CMD命令默认的初始路径
- 【Java基础】 (List、Set、Map、Stack、Queue)总结
- 求解方程根的近似解:一般迭代法
- 两个类具有相同的 XML 类型名称。请使用 @XmlType.name 和 @XmlType.namespace 为类分配不同的名称。
- 链表---在有序链表中寻找中点的两种方式
- 分子量