NLP05-Gensim源码[包与接口]

来源:互联网 发布:上瘾网络剧结局是什么 编辑:程序博客网 时间:2024/05/16 12:03

这里写图片描述

摘要:粗略从的方面查看一下gensim包中的文件结构与接口,感性地认识一下gensim的源码都有些什么东西,这个是认识Gensim源码的第一步。内容包含了文件结构,核心接口,Corpora模块,Models模块 ,Similarity模块,Models模块 ,scripts, 集成sklearn,摘要与关键词,单元测试,topic coherence这几个方面。

0.文件结构

把开gensim包,目录结构如下地出现眼前:
这里写图片描述
模块分为语料,模型等等,另外interfaces.py核心接口,matutils.py数学工具,utils.py公共方法。nosy.py这个不重要,是用来监控py文档是否有修改更的。

1. Gensim核心接口[interfaces.py]###

这里写图片描述

1.1corpusABC

Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:
语料接口(抽象基类),一个语料是一个简单的迭代器,每步产生一个文档;

>>> for doc in corpus:>>>     # do something with the doc...

A document is a sequence of (fieldId, fieldValue) 2-tuples:
一个文档是一个二元组(域id,域值)序列;

>>> for attr_id, attr_value in doc:>>>     # do something with the attribute

1.2 SimilarityABC

Abstract interface for similarity searches over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
在语料之上的相似搜索抽象接口。
所有实例中,凭借一个语料我们可以执行相似搜索。
对于每个相似搜索,输入一个文档,输出是各自相似的文档集合;
相似查询是通过调用self[query_document]这样方法来实现的。
这里也有一个方便的包装器,可以自迭代按顺序产生自已的相似性文档 。

1.3 TransformationABC

Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:
转换的接口,接收通过字典标记’[]‘的一个稀疏文档,返回取而代之的稀疏文档;

2. Corpora模块

This package contains implementations of various streaming corpus I/O format.
这个包包含了各种流式语料I/O格式的实现。
这里写图片描述
各类的层次关系,可以看成一个子类就是一个语料的储存形式了:
这里写图片描述

3.Models模块

This package contains algorithms for extracting document representations from their raw bag-of-word counts.
这个包主要是维护从源数据的词袋计算中抽取文档的表示算法;
models包下的文件结构:
这里写图片描述
各自的继承关系:
这里写图片描述

4. Similarity模块

This package contains implementations of pairwise similarity queries.
这个包是相似查询对的实现,
只有两个文件:docsim.py与index.py
docsim.py中的类如下,均继承于SimilarityABC接口。
Similarity模块下的类图:
这里写图片描述

5. Parsing模块

This package contains functions to preprocess raw text
文本预处理
里面包含两个文件:
preprocessing.py:文档的预处理,例如停用词,大小写等。
porter.py : Porter Stemming Algorithm 【词干提取算法】,来自论文
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,
算法相关信息:http://www.tartarus.org/~martin/PorterStemmer
词干提取,也就是把单词的复数,第三人称之类的单词还原成原型,例如:

"""Get rid of plurals and -ed or -ing. E.g.,   caresses  ->  caress   ponies    ->  poni   ties      ->  ti   caress    ->  caress   cats      ->  cat   feed      ->  feed   agreed    ->  agree   disabled  ->  disable   matting   ->  mat   mating    ->  mate   meeting   ->  meet   milling   ->  mill   messing   ->  mess   meetings  ->  meet"""

6. scripts

这个是一个脚本集合,方便处理与转换的,
例如

glove2word2vec.py,是GloVe vectors format 转成 word2vec text format;USAGE: $ python -m gensim.scripts.glove2word2vec --input <GloVe vector file> --output <Word2vec vector file>Where:    <GloVe vector file>: Input GloVe .txt file    <Word2vec vector file>: Desired name of output Word2vec .txt fileword2vec2tensor是word2vec转成tensor形式:USAGE: $ python -m gensim.scripts.word2vec2tensor --input <Word2Vec model file> --output <TSV tensor filename prefix> [--binary] <Word2Vec binary flag>Where:    <Word2Vec model file>: Input Word2Vec model    <TSV tensor filename prefix>: 2D tensor TSV output file name prefix    <Word2Vec binary flag>: Set True if Word2Vec model is binary. Defaults to False.Output:    The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will    us the --output file name as prefix

7. 集成sklearn

Scikit learn对于gensim的包装器:SklearnWrapperLdaModel与SklearnWrapperLsiModel

8. summarization

8.1 关键词:

def keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=[‘NN’, ‘JJ’], lemmatize=False, deacc=True)
关键词的计算用到了graph;

8.2 概述

def summarize(text, ratio=0.2, word_count=None, split=False)
主用到TextRank algorithm,计算用到了graph;

8.3 相关的数据结构及算法

BM25[bm25.py]
TextRank算法
Graph【common.py,graph.py】

9. 单元测试

10 topic coherence###

主题模型有评估模型,对于这方面的相关资料:

What is Topic Coherence?
https://rare-technologies.com/what-is-topic-coherence/

Exploring the Space of Topic Coherence Measures
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Evaluating topic coherence measures
https://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_7.pdf

Topic Coherence To Evaluate Topic Models
http://qpleple.com/topic-coherence-to-evaluate-topic-models/

对topic cohearnce的演示:
https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb

基于语义连贯性实现主题挖掘和分类 http://blog.csdn.net/shirdrn/article/details/7076505

【作者:happyprince, http://blog.csdn.net/ld326/article/details/78379449】

原创粉丝点击