NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]

来源:互联网 发布:rimworld mac a14 编辑:程序博客网 时间:2024/06/01 10:28

这里写图片描述

摘要:主要分析MmCorpus&SvmLightCorpus两个源代码,查看语料是以什么形式来保存的,对矩阵的相关储存格式进行了了解,并对相关代码进行阅读。

1. MmCorpus

1.1 MM介绍

MM是种矩阵的模型:Matrix Market File Format
《Text File Formats》
http://math.nist.gov/MatrixMarket/formats.html
《The Matrix Market File Format》
http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html

The Matrix Market File Format MM File Characteristics:
● ASCII format;
● allow comment lines, which begin with a percent sign;
● use a “coordinate” format for sparse matrices;
● use an “array” format for general dense matrices;
A file in the Matrix Market format comprises four parts:
1. Header line: contains an identifier, and four text fields;
2. Comment lines: allow a user to store information and comments;
3. Size line: specifies the number of rows and columns, and the number of nonzero elements;
4. Data lines: specify the location of the matrix entries (implicitly or explicitly) and their values.

Coordinate Format - aparse matrices(稀疏矩阵);
Array Format - dense matrices(稠密矩阵);
如下的相互转换的例子
这里写图片描述

1.2 gensim例子

Demo:

from gensim import corporatexts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]corpora.MmCorpus.serialize('tmp/deerwester.mm', corpus)

Gensim转存成的文档内容:
这里写图片描述
储存了一个9X12的矩阵,一共有28个非零项。
查看源码调用情况:
这里写图片描述
MmCorpus为IndexedCorpus的一个了类,对于MMCorpus的保存,主要是由MmCorpus调用了MmWriter来实现,可以认为,这是一个把二维数组转成coordinate保存的过程,即是保存成了稀疏矩阵。
如下为MmCorpus实现的save_corpus(),下面可以看到调用了MmWriter.write_corpus方法,这个方法是静态方法:
这里写图片描述
MmWriter.write_corpus方法:

# 把数据以MM的形式写到磁盘上【MmWriter的静态方法】def write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False):    """    Save the vector space representation of an entire corpus to disk.    """    # 创建MmWriter对像    mw = MmWriter(fname)    # write empty headers to the file (with enough space to be overwritten later)    # 加上50个空格,然后空出一行;    mw.write_headers(-1, -1, -1)  # will print 50 spaces followed by newline on the stats line    # calculate necessary header info (nnz elements, num terms, num docs) while writing out vectors    # 计算需要的头信息,头信息有非零元素数,词数,文档数    _num_terms, num_nnz = 0, 0    docno, poslast = -1, -1    offsets = []    # 判断是否有metadata数据属性    if hasattr(corpus, 'metadata'):        orig_metadata = corpus.metadata        corpus.metadata = metadata        if metadata:            docno2metadata = {}    else:        metadata = False    #  遍历二维数组,里面的元素是<词编号id,词频>;例如[[<id1,词频>,<id2,词频>,...],[<id3,词频>,<id2,词频>,...],... ]    for docno, doc in enumerate(corpus):        if metadata:            bow, data = doc            docno2metadata[docno] = data        else:            bow = doc        if docno % progress_cnt == 0:            logger.info("PROGRESS: saving document #%i" % docno)        if index:            posnow = mw.fout.tell()            if posnow == poslast:                offsets[-1] = -1            offsets.append(posnow)            poslast = posnow        #  写向量,保存成 坐标1,坐标2,非零值        max_id, veclen = mw.write_vector(docno, bow)        _num_terms = max(_num_terms, 1 + max_id)        num_nnz += veclen    if metadata:        utils.pickle(docno2metadata, fname + '.metadata.cpickle')        corpus.metadata = orig_metadata    num_docs = docno + 1    num_terms = num_terms or _num_terms    if num_docs * num_terms != 0:        logger.info("saved %ix%i matrix, density=%.3f%% (%i/%i)" % (            num_docs, num_terms,            100.0 * num_nnz / (num_docs * num_terms),            num_nnz,            num_docs * num_terms))    # now write proper headers, by seeking and overwriting the spaces written earlier    # 写头信息,把刚才空出来的行补上去    mw.fake_headers(num_docs, num_terms, num_nnz)    mw.close()    if index:        return offsets

其中,对每条记录的保存,调用MmWriter类中的write_vector方法。

# 每个向转成这样的坐标形式 【MmWriter类方法】for termid, weight in vector:  # write term ids in sorted order    self.fout.write(utils.to_utf8("%i %i %s\n" % (docno + 1, termid + 1, weight)))  # +1 because MM format starts counting from 1

2. SvmLightCorpus

参见如下连接,了解更多SvmLightCorpus: http://svmlight.joachims.org/
把上面的语料保存成svmlight形式,增加代码测试:

corpora.SvmLightCorpus.serialize('tmp/deerwester.svm', corpus)
结果:0 1:1 2:1 3:10 1:1 4:1 5:1 6:1 7:1 8:10 2:1 6:1 8:1 9:10 3:1 8:2 9:10 5:1 6:1 7:10 10:10 10:1 11:10 10:1 11:1 12:10 4:1 11:1 12:1

显示与词代表示的很相似,这个是以1开始,词袋那个以0开始显示。
关的”0“是默认显示的,本来这个种文档是有来分类保存的,当没有指定类时,也就是说,这些向量都被分成一类的。

with utils.smart_open(fname, 'wb') as fout:    for docno, doc in enumerate(corpus):        label = labels[docno] if labels else 0 # target class is 0 by default        offsets.append(fout.tell())        fout.write(utils.to_utf8(SvmLightCorpus.doc2line(doc, label)))

doc2line方法:

pairs = ' '.join("%i:%s" % (termid + 1, termval) for termid, termval in doc) # +1 to convert 0-base to 1-basereturn "%s %s\n" % (label, pairs)

一行中,label表示这个后面向量的一个分类,行号是文档,即为行号,第一个数字为列号,最后一个数据是数据项。

【作者:happyprince , http://blog.csdn.net/ld326/article/details/78396982】