NLP08-Gensim源码简析[ShardedCorpus&UciCorpus&LowCorpus]

来源：互联网发布：淘宝美工职业规划编辑：程序博客网时间：2024/06/05 04:54

摘要：分析ShardedCorpus&UciCorpus&LowCorpus三种语料结构的存储，可从语料格式方面去分析其意思。

1. ShardedCorpus

The corpus stores its data in separate files called"shards". This is a compromise between speed (keeping the whole datasetin memory) and memory footprint (keeping the data on disk and reading fromit on demand). Persistence is done using the standard gensim load/save methods.存储语料数据采用文件分片来处理，称为shards; 这是速度与内存占用相协调，保存有数据的内存与磁盘相结合按需读取数据。持久化通过标准gensim的load/save方法来实现。You can use ShardedCorpus to serialize your data just like any other gensimcorpus that implements serialization. However, because the data is savedas numpy 2-dimensional ndarrays (or scipy sparse matrices), you need tosupply the dimension of your data to the corpus. (The dimension of wordfrequency vectors will typically be the size of the vocabulary, etc.)你可以使用ShardedCorpus像使用其它序列其它gensim语料方法去序列化你的数据，但是，由于数据要储存成2维数组，你在存储时要提供这个语料数据的维度信息。这个词向量的维度比较典型的是词库大小。

例子

rom gensim import corporafrom gensim.corpora.sharded_corpus import ShardedCorpustexts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]ShardedCorpus.serialize('tmp/deerwester.shard', corpus, dim=12)

存储结果：
这里写图片描述

会生成多个像deerwester.shard.XXXX的文件 ，这个叫shard.All shards must be of the same size. The shards can be re-sized (whichis essentially a re-serialization into new-size shards), but note thatthis operation will temporarily take twice as much disk space, becausethe old shards are not deleted until the new shards are safely in place.所有的shards大小都是一样的，当想对这些shard重新设置大小时，实际是重新生成一份新的shard,会注意到，做这个操作时，磁盘会临时性多出一倍的容量，因为直到新的shards生成之前旧的shard还未被删除。Since the corpus needs the data serialized in order to be able to operate,it will serialize data right away on initialization. Instead of calling`ShardedCorpus.serialize()`, you can just initialize and use the corpusright away:

>>> corpus = ShardedCorpus(output_prefix, corpus, dim=1000)>>> batch = corpus[100:150]

# 当把数据加载入来后，sh_carpus可以转成gensim的标准数据来计算sh_corpus.gensim = True# batch01是一个迭代器batch01 = sh_corpus[0:9]for data in batch:    print(data)当我们操作数据要储存在磁盘时，这样的操作方法比较好，数据在操作的时候已经保存于磁盘上了。corpus = ShardedCorpus('tmp/deerwester.shard', corpus, dim=num)batch = corpus[0:9]print(batch)

为什么呢？如下的时序图：
这里写图片描述
ShardedCorpus类构造函数：

def __init__(self, output_prefix, corpus, dim=None,             shardsize=4096, overwrite=False, sparse_serialization=False,             sparse_retrieval=False, gensim=False):  """    :type output_prefix: str    :param output_prefix: 数据存放路径    :type corpus: gensim.interfaces.CorpusABC    :param corpus: The source corpus from which to build the dataset.                   数据创建数据集的corpus源    :type dim: int    :param dim: 指定数据的维度，一般为单词数    :type shardsize: int    :param shardsize: How many data points should be in one shard.                       一个shard文件中存多少个数据点。    :type overwrite: bool    :param overwrite: 如果设置了True，会把已经存在的数据进行覆盖。    :type sparse_serialization: bool    :param sparse_serialization: I如果设置了，会把数据以稀疏的格式保存（csr 矩阵）；    :type sparse_retrieval: bool    :param sparse_retrieval: 如果设置了，以稀疏数据（numpy csr 矩阵）的形式去检索返回数据，否则以是返回ndarrarys。    :type gensim: bool    :param gensim: 如果设置了，把输出转去gensim稀疏向的格式（<id,value>的二元数组列表）    """if (not os.path.isfile(output_prefix)) or overwrite:    self.init_shards(output_prefix, corpus, shardsize)    self.save()else:    self.init_by_clone()

分两种情况来对语料进行处理，当没有存在，或设置覆盖时，调用init_shards()函数；否则调用init_by_clone().如果存在这份数据了，就相关的数据进行加载。
对于init_shards()函数，主要考查如下两个例子，这里考查第一个utils中的工具函数：
def chunkize_serial(iterable, chunksize, as_numpy=False)
这个函数赋值给grouper变量，对外调用grouper(iterable, chunksize)函数.
grouper函数对iterable数据，按chunksize为大小作为一个分组，例子：

texts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]print(corpus)pprint(list(grouper(corpus, 3)))

运行结果：

原数据：[[(0, 1), (1, 1), (2, 1)], [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (3, 1), (5, 1), (8, 1)], [(0, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]分组后的数据：[[[(0, 1), (1, 1), (2, 1)],  [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],  [(2, 1), (3, 1), (5, 1), (8, 1)]], [[(0, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)]], [[(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]]

接着考虑如下的代码：

current_shard = numpy.zeros((2, 12), dtype=float)print('安始化的current_shard:')print(current_shard)dict_data = dict([(1,1),(2,1),(3,2)])print('用list构造dict类形数据：dict([(1,1),(2,1),(3,2)])->')pprint(dict_data)g1 = list(gensim.matutils.itervalues(dict_data))print('把dict的值取出形成列表：list(gensim.matutils.itervalues(dict_data))->',g1)l1 = list(dict_data)print('把dict的key的值取出形成列表：list(dict_data)->',l1)current_shard[0][l1] = g1print('查看填充后的结果current_shard:',current_shard)

运行结果为：

安始化的current_shard:[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.] [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]用list构造dict类形数据：dict([(1,1),(2,1),(3,2)])->{1: 1, 2: 1, 3: 2}把dict的值取出形成列表：list(gensim.matutils.itervalues(dict_data))-> [1, 1, 2]把dict的key的值取出形成列表：list(dict_data)-> [1, 2, 3]查看填充后的结果current_shard: [[ 0.  1.  1.  2.  0.  0.  0.  0.  0.  0.  0.  0.] [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

总结，这个类出现的目的是使数据分散保存。适应大数据可行性。

最后这个语料的运用：

from gensim import corporafrom gensim.corpora.sharded_corpus import ShardedCorpustexts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]num = len(dictionary.keys())corpus = ShardedCorpus('tmp/deerwester.shard', corpus, dim=num, shardsize=2, overwrite=True)batch = corpus[2:5]print(batch)

运行结束后会生成5个文件，corpus可以经过下标去访问。

2. UCI数据集

UCI数据集是一个常用的标准测试数据集。
http://archive.ics.uci.edu/ml/index.php
保存格式：

D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count

如下演示：

from gensim import corporatexts = [['human', 'interface', 'computer'],         ['survey', 'user', 'computer', 'system', 'response', 'time'],         ['eps', 'user', 'interface', 'system'],         ['system', 'human', 'system', 'eps'],         ['user', 'response', 'time'],         ['trees'],         ['graph', 'trees'],         ['graph', 'minors', 'trees'],         ['graph', 'minors', 'survey']]dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]corpora.UciCorpus.serialize('tmp/deerwester.uci', corpus)

结果：

9                   12                  28                  1 1 11 2 11 3 12 2 12 4 12 5 12 6 12 7 12 8 13 3 13 4 13 6 13 9 14 1 14 6 24 9 15 4 15 7 15 8 16 10 17 10 17 11 18 10 18 11 18 12 19 5 19 11 19 12 1

3. LowCorpus

LowCorpus的格式：
GibbsLda++数据输入格式,单词列表的形式.
引用http://gibbslda.sourceforge.net/#3.2_Input_Data_Format的说明：
训练/评估模型数据与新的数据有相同的格式，下下所示：

[M][document1][document2]...[documentM]

第一行[M]表示共有多少个文档。
除第一行外的其它各行[documenti]，表示数据集中第i个文档，这个文档包括了词组序列，如下所示：
[documenti] = [wordi1] [wordi2] … [wordiNi]
[wordij]是文本字符串，他们是以空格隔开的。

LowCorpus构造方法：

def __init__(self, fname, id2word=None, line2words=split_on_space):    """    Initialize the corpus from a file.从文档初始化语料。    `id2word` and `line2words` are optional parameters.    `id2word` 与 `line2words` 是可选参数    If provided, `id2word` is a dictionary mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed from the documents.id2word是一个词Id与词的映射对词典。如果不设置，这个映射会由所有的文档来构建。    `line2words` is a function which converts lines into tokens. Defaults to    simple splitting on spaces.line2words是一个函数，这个函数把行数扰转化成字符记号，默认是简单地由空格分隔开。即line2words是一个以空格为间隔的分词函数。    """

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78427162】

阅读全文

1 0