NLP10-Gensim源码简析[LsiModel]

来源：互联网发布：sql注入入门编辑：程序博客网时间：2024/05/21 09:30

摘要：浏览完整个代码，对整个思路有所了解，实现LSI模型其实就是一个SVD分解，然后进行TSVD截断奇异值分解；采用了两个算法：随机二阶段相似算法，另一个采用了svdlibc中实现的Lanczos算法。另外，采用两种计算模式：一个是单机；一个分布式；分布式的实现是依赖了Pyro4框架来实现，先实现一个调度器，让调度器来实现工作节点，每个工作节点又相当于单机了，这个实现也是串联来计算的，一个一个工作节点遍历，没有实现并行。留下2个代码没深入分析： Projection对象中的merge函数，另一个stochastic_svd（）实现SVD随机分解未深入研究。
看代码之前先从宏观上理解一下LSI：http://blog.csdn.net/ld326/article/details/78461778
看完了，如果清晰了可以不接着看下去了。
另外，对于分布式计算那个，选择从感性上了解一下代码，可以看小例：
http://blog.csdn.net/ld326/article/details/78467885

0 引入例子

from gensim import corporafrom gensim import models# import logging# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)def get_corpus_dictionary():    documents = ["Human machine interface for lab abc computer applications",                 "A survey of user opinion of computer system response time",                 "The EPS user interface management system",                 "System and human system engineering testing of EPS",                 "Relation of user perceived response time to error measurement",                 "The generation of random binary unordered trees",                 "The intersection graph of paths in trees",                 "Graph minors IV Widths of trees and well quasi ordering",                 "Graph minors A survey"]    stoplist = set('for a of the and to in'.split())    texts = [[word for word in document.lower().split() if word not in stoplist]             for document in documents]    from collections import defaultdict    frequency = defaultdict(int)    for text in texts:        for token in text:            frequency[token] += 1    texts = [[token for token in text if frequency[token] > 1]             for text in texts]    dictionary = corpora.Dictionary(texts)    corpus = [dictionary.doc2bow(text) for text in texts]    print('原文本：')    for text in texts:        print(text)    return corpus, dictionarycorpus,dictionary = get_corpus_dictionary()print('=================dictinary=============')print('词ID到这个词在多少篇文档数的映射(dfs):',dictionary.dfs)print('词到id编码的映射(token2id):',dictionary.token2id)print('id编码到词的映射(id2token):',dictionary.id2token)print('处理的文档数(num_docs):',dictionary.num_docs)print('没有去重词条总数(num_pos):',dictionary.num_pos)print('对文档内去重后的词条总数，文档间相同词不去重，只要记录BOW矩阵的非零元素个数(num_nnz):',dictionary.num_nnz)print('=================dictinary=============')print('原词袋表示：')for c in corpus:    print(c)tfidf = models.TfidfModel(corpus)corpus_tfidf = tfidf[corpus]print('转换整个语料库：')for doc in corpus_tfidf:    print(doc)lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformationcorpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsiprint('Lsi模型：')lsi.print_topics(2)for doc in corpus_lsi:  # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly    print(doc)

结果

原文本：['human', 'interface', 'computer']['survey', 'user', 'computer', 'system', 'response', 'time']['eps', 'user', 'interface', 'system']['system', 'human', 'system', 'eps']['user', 'response', 'time']['trees']['graph', 'trees']['graph', 'minors', 'trees']['graph', 'minors', 'survey']=================dictinary=============词ID到这个词在多少篇文档数的映射(dfs): {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 3, 6: 3, 7: 2, 8: 2, 9: 3, 10: 3, 11: 2}词到id编码的映射(token2id): {'user': 6, 'human': 1, 'response': 7, 'minors': 11, 'computer': 2, 'trees': 9, 'survey': 3, 'graph': 10, 'eps': 8, 'interface': 0, 'time': 4, 'system': 5}id编码到词的映射(id2token): {}处理的文档数(num_docs): 9没有去重词条总数(num_pos): 29对文档内去重后的词条总数，文档间相同词不去重，只要记录BOW矩阵的非零元素个数(num_nnz): 28=================dictinary=============原词袋表示：[(0, 1), (1, 1), (2, 1)][(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)][(0, 1), (5, 1), (6, 1), (8, 1)][(1, 1), (5, 2), (8, 1)][(4, 1), (6, 1), (7, 1)][(9, 1)][(9, 1), (10, 1)][(9, 1), (10, 1), (11, 1)][(3, 1), (10, 1), (11, 1)]转换整个语料库：[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)][(2, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.3244870206138555), (7, 0.44424552527467476)][(0, 0.5710059809418182), (5, 0.4170757362022777), (6, 0.4170757362022777), (8, 0.5710059809418182)][(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)][(4, 0.6282580468670046), (6, 0.45889394536615247), (7, 0.6282580468670046)][(9, 1.0)][(9, 0.7071067811865475), (10, 0.7071067811865475)][(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)][(3, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]Lsi模型：[(0, 0.066007833960902054), (1, -0.52007033063618524)][(0, 0.19667592859142241), (1, -0.76095631677000564)][(0, 0.089926399724461856), (1, -0.72418606267525076)][(0, 0.075858476521779475), (1, -0.63205515860034256)][(0, 0.10150299184979916), (1, -0.57373084830029641)][(0, 0.70321089393783209), (1, 0.16115180214025548)][(0, 0.87747876731198438), (1, 0.16758906864659107)][(0, 0.90986246868185905), (1, 0.14086553628718695)][(0, 0.6165825350569285), (1, -0.05392907566389589)]

1 LsiModel构造函数

主要是对模型参数进行设置。

1.1 函数接口与参数解释

def __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000,decay=1.0, distributed=False, onepass=True, power_iters=P2_EXTRA_ITERS, extra_samples=P2_EXTRA_DIMS)corpus: 语料，如果指定了语料就使用语料，或者可用add_document这个方法去增加语料。num_topics: 分解后的主题数onepass: 表示随机算法采用是否是一路或多路；onepass=True表示为单路算法，否则为多路算法。id2word: 词id到词的映射；chunksize:在训练过程中，一次训练文档数。chunksize的大小是速度与内存的一个折衷。如果是分布式运行环境，每个chunk会被发送到不同的工作节点。decay：当文档更新时，希望模型更顷向于哪个语料训练的结果，是旧的语料还是现在所给的新的语料 ；distributed：是否是分布式计算；`power_iters` 与 `extra_samples` 是算法里面的参数，会影响随机多路算法的正确性，当onepass=True内部使用，否则使用前端算法；`power_iters` and `extra_samples` affect the accuracy of the stochasticmulti-pass algorithm, which is used either internally (`onepass=True`) oras the front-end algorithm (`onepass=False`).

1.2 分布式与onepass关系

if distributed:    if not onepass:        logger.warning("forcing the one-pass algorithm for distributed LSA")        onepass = True

如果是分布式计算，onepass为True,也就是只采用一路算法；

1.3 求取id2word映射

if self.id2word is None:    logger.warning("no word id mapping provided; initializing from corpus, assuming identity")    self.id2word = utils.dict_from_corpus(corpus)    self.num_terms = len(self.id2word)else:    self.num_terms = 1 + (max(self.id2word.keys()) if self.id2word else -1)

1.4 初化投影对象

self.projection = Projection(self.num_terms, self.num_topics, power_iters=self.power_iters, extra_dims=self.extra_samples)

这里面docs=None的数据，所以，只是初始了一个没有干事的对象.

1.5 分布式计算

如果是分布式计算，初始化调度器dispatcher，gensim分布式计算依赖于Pyro4计算包，可以查看一个简单的例子（http://blog.csdn.net/ld326/article/details/78467885），经过操作简单的感性了解Pyro4的运行，查看Gensim中的分布计算请求服务器计算如下：

import Pyro4# 连接gensim.lsi_dispatcher代理，获取dispatcher对像dispatcher = Pyro4.Proxy('PYRONAME:gensim.lsi_dispatcher')# 初始化dispatcher对像dispatcher.initialize(id2word=self.id2word, num_topics=num_topics,                      chunksize=chunksize, decay=decay,                      power_iters=self.power_iters, extra_samples=self.extra_samples,                      distributed=False, onepass=onepass)# 赋值与运行的节点数self.dispatcher = dispatcherself.numworkers = len(dispatcher.getworkers())

1.6 为SVD模型更新语料

if corpus is not None:    self.add_documents(corpus)

整个LsiModel模块的构造方法只是计算相关参数，为后面的SVD模型更新作服务，无论是单机还是分布计算。

2. 投影类分析

构造方法上创建了一个投影类，查看一下投影类的功能。

2.1 构造方法与解释

 def __init__(self, m, k, docs=None, use_svdlibc=False, power_iters=P2_EXTRA_ITERS, extra_dims=P2_EXTRA_DIMS)  m:  字典中的词数，len(self.id2word)  k: 主题数，用户给定的。  docs:文档列表，如果这个为空就没法构造(U,S)投影了。  use_svdlibc:是否使用svdlibc包：http://tedlab.mit.edu/~dr/SVDLIBC/  power_iters&extra_dims：算法的两参数

2.2 代码分析【主是对docs文档矩阵的分解】

# 首先docs不可为空if docs is not None:    # given a job `docs`, compute its decomposition,    # 没有使用svdlibc包计算，调用了个stochastic_svd（）方法进行了随机SVD算法分解    if not use_svdlibc: # 采用随机方法求解svd: #  参考论文：FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONShttps://arxiv.org/pdf/0909.4061.pdf        u, s = stochastic_svd(            docs, k, chunksize=sys.maxsize,            num_terms=m, power_iters=self.power_iters,            extra_dims=self.extra_dims)    else：        # 记得要安装sparsesvd模块，为了说明方便，把异常检测删除了        import sparsesvd        if not scipy.sparse.issparse(docs):            # 如果不是稀疏的矩阵，转成稀疏的形式            docs = matutils.corpus2csc(docs) # 调用sparsesvd进行SVD分解        ut, s, vt = sparsesvd.sparsesvd(docs, k + 30)        # 以下是保留ut与s,vt不要，因为这里采用的是TSVD（截断奇异值分解）        u = ut.T        del ut, vt        # 重新检测与更新这个K值，保证有限信息的最小维度，这个返回值不会大于k的。        k = clip_spectrum(s**2, self.k)    # 保留ut与s     self.u = u[:, :k].copy()    self.s = s[:k].copy()else:    self.u, self.s = None, None

3 根据语料更新SVD的函数add_documents()

功能：根据一些新文档语料更新SVD模型；
Update singular value decomposition to take into account a new
corpus of documents.

3.1 函数数据声明

def add_documents(self, corpus, chunksize=None, decay=None)corpus: 语料chunksize: 在训练过程中，一次训练文档数。chunksize的大小是速度与内存的一个折衷。如果是分布式运行环境，每个chunk会被发送到不同的工作节点。Training proceeds in chunks of `chunksize` documents at a time. The size of`chunksize` is a tradeoff between increased speed (bigger `chunksize`)vs. lower memory footprint (smaller `chunksize`). If the distributed modeis on, each chunk is sent to a different worker/computer.decay: 当文档更新时，希望模型更顷向于哪个语料训练的结果，是旧的语料还是现在所给的新的语料 。 设置decay<1.0时，在文档流中会引起反方向的数据趋势。这个允许LSA慢慢会忘记旧的文档更偏向新的文档。Setting `decay` < 1.0 causes re-orientation towards new data trends in theinput document stream, by giving less emphasis to old observations. This allows LSA to gradually "forget" old observations (documents) and give morepreference to new ones.

3.2 代码解读

# 判断语料的稀疏性if not scipy.sparse.issparse(corpus):            if not self.onepass:                # we are allowed multiple passes over the input => use a faster, randomized two-pass algo                # 多路算法，使用更快的随机二路算法                update = Projection(self.num_terms, self.num_topics, None)                update.u, update.s = stochastic_svd(                    corpus, self.num_topics,                    num_terms=self.num_terms, chunksize=chunksize,                    extra_dims=self.extra_samples, power_iters=self.power_iters)                # 两SVD进行融合                self.projection.merge(update, decay=decay)  # 处理文本数                self.docs_processed += len(corpus) if hasattr(corpus, '__len__') else 0            else:                # the one-pass algo                doc_no = 0                # 对一路算法中的分布式                if self.dispatcher:                    # 重置调度器                    self.dispatcher.reset()                # 对语料进行分组遍历                for chunk_no, chunk in enumerate(utils.grouper(corpus, chunksize)):                    # 获取总单词数                    nnz = sum(len(doc) for doc in chunk)                    # 构造一个job为稀疏矩阵                    job = matutils.corpus2csc(chunk, num_docs=len(chunk), num_terms=self.num_terms, num_nnz=nnz)                    del chunk                    doc_no += job.shape[1]                    if self.dispatcher:                        # 对于分布式计算的，先把job增加到队列中，使所有节点可以处理它                        self.dispatcher.putjob(job)                        del job                    else:                        # 序列化处理的，只有一个工作节节点，直接处理job数据                        update = Projection(self.num_terms, self.num_topics, job, extra_dims=self.extra_samples, power_iters=self.power_iters)                        del job# 处理完进行合并                        self.projection.merge(update, decay=decay)                        del update                        logger.info("processed documents up to #%s", doc_no)                        self.print_topics(5)                # wait for all workers to finish (distributed version only)                # 如果是需要调度的，是等所有工作节点完了才可以完成，其实上面的putjob只是把一份一份数据转到队列上，当调用了getstate时，才遍历所有的工作节点进行计算，计算每个进行合并，计算完之后把结果返回                if self.dispatcher:                    self.projection = self.dispatcher.getstate()                self.docs_processed += doc_no        else:            # 稀疏的直接处理            update = Projection(self.num_terms, self.num_topics, corpus.tocsc(), extra_dims=self.extra_samples, power_iters=self.power_iters)            self.projection.merge(update, decay=decay)            self.docs_processed += corpus.shape[1]

3. 分布式中的调度器

注册名为gensim.lsi_dispatcher，这个对象在model.lsi_dispatcher.py文件中Dispatcher类实例化后的，这个类主要是用worker进行管理与调度，真正的工作节点注册在Pyro4的名字的前缀为gensim.lsi_worker.

3.1 注册Dispatcher对象

utils.pyro_daemon('gensim.lsi_dispatcher', Dispatcher(maxsize=maxsize))

里面注册的代码与前一篇博文的小例一样，略过。

3.2 initialize（）函数

功能：获取所有可用的worker节点，保存好，方便后的RMI调度

# locate all available workers and store their proxies, for subsequent RMI calls# 保存可用的工作节点列表self.workers = {}# 获取名字服务器with utils.getNS() as ns:    # 保存一个自己的连接    self.callback = Pyro4.Proxy('PYRONAME:gensim.lsi_dispatcher') # = self    # 遍历所有以gensim.lsi_worker为前缀名的工作节点    for name, uri in iteritems(ns.list(prefix='gensim.lsi_worker')):        try:            # 获取uri的计算节点            worker = Pyro4.Proxy(uri)            # 设计uri的下标            workerid = len(self.workers)            # 初始化计算节点            worker.initialize(workerid, dispatcher=self.callback, **model_params)            # 把计算节点保存成数组            self.workers[workerid] = worker        except Pyro4.errors.PyroError:            ns.remove(name)

4. 分布式的计算节点

注册名为的前缀为gensim.lsi_worker，这个对象在model.lsi_worker.py文件中Worker类实例化后的，注册代码：

utils.pyro_daemon('gensim.lsi_worker', Worker(), random_suffix=True)

4.1 初始化

初始化过程中，主要是处理一样参数，进程锁，创建要用来计算模型类对象，如下创建了一个lsimodel.LsiModel的对象，这样又调回到刚的代码段的，这个知道为什么要设置那个为单路，不能矛盾呀，这样又开始的LsiModel的非分布式的计算了。

self.model = lsimodel.LsiModel(**model_params)

注意：虽然都是调用用了同一个模型代码，当程序运行起来就不一样了，这个是运行在远程的服务；之前那个是运行在客户端。

4.2 工作节点

计算节点的其它方法都是由调度器来调度的，每个工作节点都包含了lsimodel模型，计算时直接调用，核心的方法为Project对像计算。

总结：
浏览完整个代码，对整个思路有所了解：
实现LSI模型其实就是一个SVD分解，SVD的实现采用了两个算法：随机二阶段相似算法，另一个采用了svdlibc中实现的Lanczos算法。另外，采用两种计算模式：一个是单机；一个分布式；分布式的实现是依赖了Pyro4框架来实现，先实现一个调度器，让调度器来实现工作节点，每个工作节点又相当于单机了，这个实现也是串联来计算的，一个一个工作节点遍历，没有实现并行。

遗留问题：
{1} Projection对象中的merge函数需进一步理解里面运算；
{2} stochastic_svd（）实现SVD随机分解未深入研究。
{3} 分布式计算的各方法，没有写出来分析。

5. 相关资料

5.1 论文01

Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms
http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf
阅读：Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms
两个流矩阵分解算法比较
机构：捷克马萨里克大学 Masaryk University
摘要：
With the explosion of the size of digital dataset, the limiting factor for decom-
position algorithms is the number of passes over the input, as the input is often
stored out-of-core or even off-site. Moreover, we’re only interested in algorithms
that operate in constant memory w.r.t. to the input size, so that arbitrarily large
input can be processed. In this paper, we present a practical comparison of two
such algorithms: a distributed method that operates in a single pass over the input
vs. a streamed two-pass stochastic algorithm. The experiments track the effect
of distributed computing, oversampling[过采样方法] and memory trade-offs[折衷] on the accuracy
and performance of the two algorithms. To ensure meaningful results, we choose
the input to be a real dataset, namely the whole of the English Wikipedia, in the
application settings of Latent Semantic Analysis.

随着数据集爆炸式增长，对于分解性算法输入通道数据成为一个限制的因素，而输入经常会储存在核外或异地。而且，我们只对在内存常的输入大小感兴趣，以致任意大的数据都可以被处理。这篇论文中，讲述了两个算法的对比：一个是分布式算法，在基于输入单通道操作的，另一个是流式两通通随机算法。两个算法在准确性与性能上过采样与内存折衷的方法，追踪分布式计算实验的效果。为了保证有意义的结果输出，数据集选择真实的数据，即英文Wiki百科的所有数据，应用到LSA模型上。

5.2论文02

阅读：FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS
https://arxiv.org/pdf/0909.4061.pdf
摘要：
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.
This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.

5.3 论文PPT

论文：FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS
PPT讲解：http://www.doc88.com/p-9793455943477.html
截断奇异值分解，也就是在SVD分解后，取最前的k个，称为k-SVD.这个就变成了近似了，不过这个对于计算来说效率会很好。
这里写图片描述
计算k-SVD分解，采用两阶段随机算法，下面是这个算法概述:

阶段A：查找范围
A 约等于Q乘以Q的转置乘以A，Q的列是互列正交的。
阶段B：构造分解

QQ*正交投映在Q的值域内。

时间复杂度对比较：

5.4 其它

k-svd应用到文本分类的sklearn实现的例子：
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78474013】

阅读全文

1 0