NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]
来源:互联网 发布:rimworld mac a14 编辑:程序博客网 时间:2024/06/01 10:28
摘要:主要分析MmCorpus&SvmLightCorpus两个源代码,查看语料是以什么形式来保存的,对矩阵的相关储存格式进行了了解,并对相关代码进行阅读。
1. MmCorpus
1.1 MM介绍
MM是种矩阵的模型:Matrix Market File Format
《Text File Formats》
http://math.nist.gov/MatrixMarket/formats.html
《The Matrix Market File Format》
http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html
The Matrix Market File Format MM File Characteristics:
● ASCII format;
● allow comment lines, which begin with a percent sign;
● use a “coordinate” format for sparse matrices;
● use an “array” format for general dense matrices;
A file in the Matrix Market format comprises four parts:
1. Header line: contains an identifier, and four text fields;
2. Comment lines: allow a user to store information and comments;
3. Size line: specifies the number of rows and columns, and the number of nonzero elements;
4. Data lines: specify the location of the matrix entries (implicitly or explicitly) and their values.
Coordinate Format - aparse matrices(稀疏矩阵);
Array Format - dense matrices(稠密矩阵);
如下的相互转换的例子
1.2 gensim例子
Demo:
from gensim import corporatexts = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]corpora.MmCorpus.serialize('tmp/deerwester.mm', corpus)
Gensim转存成的文档内容:
储存了一个9X12的矩阵,一共有28个非零项。
查看源码调用情况:
MmCorpus为IndexedCorpus的一个了类,对于MMCorpus的保存,主要是由MmCorpus调用了MmWriter来实现,可以认为,这是一个把二维数组转成coordinate保存的过程,即是保存成了稀疏矩阵。
如下为MmCorpus实现的save_corpus(),下面可以看到调用了MmWriter.write_corpus方法,这个方法是静态方法:
MmWriter.write_corpus方法:
# 把数据以MM的形式写到磁盘上【MmWriter的静态方法】def write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False): """ Save the vector space representation of an entire corpus to disk. """ # 创建MmWriter对像 mw = MmWriter(fname) # write empty headers to the file (with enough space to be overwritten later) # 加上50个空格,然后空出一行; mw.write_headers(-1, -1, -1) # will print 50 spaces followed by newline on the stats line # calculate necessary header info (nnz elements, num terms, num docs) while writing out vectors # 计算需要的头信息,头信息有非零元素数,词数,文档数 _num_terms, num_nnz = 0, 0 docno, poslast = -1, -1 offsets = [] # 判断是否有metadata数据属性 if hasattr(corpus, 'metadata'): orig_metadata = corpus.metadata corpus.metadata = metadata if metadata: docno2metadata = {} else: metadata = False # 遍历二维数组,里面的元素是<词编号id,词频>;例如[[<id1,词频>,<id2,词频>,...],[<id3,词频>,<id2,词频>,...],... ] for docno, doc in enumerate(corpus): if metadata: bow, data = doc docno2metadata[docno] = data else: bow = doc if docno % progress_cnt == 0: logger.info("PROGRESS: saving document #%i" % docno) if index: posnow = mw.fout.tell() if posnow == poslast: offsets[-1] = -1 offsets.append(posnow) poslast = posnow # 写向量,保存成 坐标1,坐标2,非零值 max_id, veclen = mw.write_vector(docno, bow) _num_terms = max(_num_terms, 1 + max_id) num_nnz += veclen if metadata: utils.pickle(docno2metadata, fname + '.metadata.cpickle') corpus.metadata = orig_metadata num_docs = docno + 1 num_terms = num_terms or _num_terms if num_docs * num_terms != 0: logger.info("saved %ix%i matrix, density=%.3f%% (%i/%i)" % ( num_docs, num_terms, 100.0 * num_nnz / (num_docs * num_terms), num_nnz, num_docs * num_terms)) # now write proper headers, by seeking and overwriting the spaces written earlier # 写头信息,把刚才空出来的行补上去 mw.fake_headers(num_docs, num_terms, num_nnz) mw.close() if index: return offsets
其中,对每条记录的保存,调用MmWriter类中的write_vector方法。
# 每个向转成这样的坐标形式 【MmWriter类方法】for termid, weight in vector: # write term ids in sorted order self.fout.write(utils.to_utf8("%i %i %s\n" % (docno + 1, termid + 1, weight))) # +1 because MM format starts counting from 1
2. SvmLightCorpus
参见如下连接,了解更多SvmLightCorpus: http://svmlight.joachims.org/
把上面的语料保存成svmlight形式,增加代码测试:
corpora.SvmLightCorpus.serialize('tmp/deerwester.svm', corpus)
结果:0 1:1 2:1 3:10 1:1 4:1 5:1 6:1 7:1 8:10 2:1 6:1 8:1 9:10 3:1 8:2 9:10 5:1 6:1 7:10 10:10 10:1 11:10 10:1 11:1 12:10 4:1 11:1 12:1
显示与词代表示的很相似,这个是以1开始,词袋那个以0开始显示。
关的”0“是默认显示的,本来这个种文档是有来分类保存的,当没有指定类时,也就是说,这些向量都被分成一类的。
with utils.smart_open(fname, 'wb') as fout: for docno, doc in enumerate(corpus): label = labels[docno] if labels else 0 # target class is 0 by default offsets.append(fout.tell()) fout.write(utils.to_utf8(SvmLightCorpus.doc2line(doc, label)))
doc2line方法:
pairs = ' '.join("%i:%s" % (termid + 1, termval) for termid, termval in doc) # +1 to convert 0-base to 1-basereturn "%s %s\n" % (label, pairs)
一行中,label表示这个后面向量的一个分类,行号是文档,即为行号,第一个数字为列号,最后一个数据是数据项。
【作者:happyprince , http://blog.csdn.net/ld326/article/details/78396982】
- NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]
- NLP06-Gensim源码简析[字典]
- NLP09-Gensim源码简析[TfidfModel]
- NLP10-Gensim源码简析[LsiModel]
- NLP08-Gensim源码简析[ShardedCorpus&UciCorpus&LowCorpus]
- GENSIM
- GENSIM
- gensim
- NLP05-Gensim源码[包与接口]
- gensim试用
- gensim安装
- gensim introduction
- gensim工具包
- 安装gensim
- gensim word2vec
- Gensim入门教程
- Gensim 安装
- gensim Word2vec
- maven创建eclipse工程
- 2017.10.30
- Django Admin管理工具
- centos7 javaweb 链接中文请求tomcat乱码问题解决
- LeetCode110. Balanced Binary Tree
- NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]
- CSS文件在jsp 文件中应该放的位置
- 简单的顺序表中的一些函数以及一些简单的测试
- 717. 1-bit and 2-bit Characters
- numpy power ValueError: Integers to negative integer powers are not allowed.
- pandas 基本使用
- Dynamic CRM 插件或工作流中调用自定义类库
- 2017.10.28面经
- 软件开发流程