NLP02-Gensim语料与向量空间

来源：互联网发布：汽车调色软件编辑：程序博客网时间：2024/05/20 14:22

摘要：对Gensim的语料与向量空间的官方文档的学习，对相关内容进行记录与翻译，并实践操作进行记录。

gensim使用文档：《Corpora and Vector Spaces》
来源：https://radimrehurek.com/gensim/tut1.html

以下为文档阅读理解后记录内容：

This tutorial is available as a Jupyter Notebook here.
这个教程Jupyter文档在以下网址上：
https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb

如果你想看日志信息，增加日志：

>>> import logging>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

From Strings to Vectors

字符串转向量

This time, let’s start from documents represented as strings:
让我们从由字符串表达的文档开始：

>>> from gensim import corpora>>>>>> documents = ["Human machine interface for lab abc computer applications",>>>              "A survey of user opinion of computer system response time",>>>              "The EPS user interface management system",>>>              "System and human system engineering testing of EPS",>>>              "Relation of user perceived response time to error measurement",>>>              "The generation of random binary unordered trees",>>>              "The intersection graph of paths in trees",>>>              "Graph minors IV Widths of trees and well quasi ordering",>>>              "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence.
9条小型的语料文档，每个文档只由单句组成。

First, let’s tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus:
首先分词，把在词料中的停用词与出现一次词删除掉；

>>> # remove common words and tokenize>>> stoplist = set('for a of the and to in'.split())>>> texts = [[word for word in document.lower().split() if word not in stoplist]>>>          for document in documents]>>>>>> # remove words that appear only once>>> from collections import defaultdict>>> frequency = defaultdict(int)>>> for text in texts:>>>     for token in text:>>>         frequency[token] += 1>>>>>> texts = [[token for token in text if frequency[token] > 1]>>>          for text in texts]>>>>>> from pprint import pprint  # pretty-printer>>> pprint(texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

To convert documents to vectors, we’ll use a document representation called bag-of-words. In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:
“How many times does the word system appear in the document? Once.”
It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:
使用bag-of-word进行文件表达处理，即把文档转成向量；使得这种表达，每个文档用一个向量来表达，每个向量元素是一个问答对，如下的形式：”单词system在这篇文档中出现了多少次？一次“；
这样有利于通过整数ID标识一个问题，这种词与整数的标识映射叫做字典。

>>> dictionary = corpora.Dictionary(texts)>>> dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference>>> print(dictionary)Dictionary(12 unique tokens)

Here we assigned a unique integer id to all words appearing in the corpus with the gensim.corpora.dictionary.Dictionary class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:
这里使用gensim.corpora.dictionary.Dictionary 类，把语料中的每个词都赋予了一个唯一的整型ID.扫描所有文本，计算词频及相关统计，最后，在这个处理的语料中得出12个不同的词，这也就是每篇文档要通过12个数据去表达出来，也即是12维的向。看一下词与ids对：

>>> print(dictionary.token2id){'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

To actually convert tokenized documents to vectors:
把文档转成向量：

>>> new_doc = "Human computer interaction">>> new_vec = dictionary.doc2bow(new_doc.lower().split())>>> print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored[(0, 1), (1, 1)]

The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.
方法doc2bow()很容易就可以计算每个词出现的频率数，把这些单词转成整型id并把回稀疏向量作为结果。这个稀疏向量表示为：在文档“Human computer interaction”中，单词computer(id为0)和human(id为1)分别出一次；字典其它的10个词在这里隐式表示出现0次。

>>> corpus = [dictionary.doc2bow(text) for text in texts]>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use>>> print(corpus)[(0, 1), (1, 1), (2, 1)][(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)][(2, 1), (5, 1), (7, 1), (8, 1)][(1, 1), (5, 2), (8, 1)][(3, 1), (6, 1), (7, 1)][(9, 1)][(9, 1), (10, 1)][(9, 1), (10, 1), (11, 1)][(4, 1), (10, 1), (11, 1)]

By now it should be clear that the vector feature with id=10 stands for the question “How many times does the word graph appear in the document?” and that the answer is “zero” for the first six documents and “one” for the remaining three.
现在应该很明朗了，特征向量中，id=10代表“在这个文档中单词graph共出现了多少次？”，前6条文档回答为0，后3条文档回答为1。

Corpus Streaming – One Document at a Time

语料流

Note that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time:
注意到现在的语料以python的列表的形式保存在内存中，对于这样简单的例子，这是没有问题的，也很清晰，但假设我们有上万篇文档，在内存上储存是有问题的。假设我们的文档保存在做磁盘文件上，一个文档一行，Gensim要求语料能通同时返回文档向量；

>>> class MyCorpus(object):>>>     def __iter__(self):>>>         for line in open('mycorpus.txt'):>>>             # assume there's one document per line, tokens separated by whitespace>>>             yield dictionary.doc2bow(line.lower().split())

Download the sample mycorpus.txt file here. The assumption that each document occupies one line in a single file is not important; you can mold the iter function to fit your input format, whatever it is. Walking directories, parsing XML, accessing network… Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside__iter__.
下载样本【https://radimrehurek.com/gensim/mycorpus.txt】，第个文档在一行不是很重要，可以构建一个iter函数去匹配输入的格式，遍历目录，解释XML，访问network…只要可把输入文档解释一串分词形式，然后把这样词编码并产生稀疏向量对像，内含有迭代器iter就可以了。

>>> corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!>>> print(corpus_memory_friendly)<__main__.MyCorpus object at 0x10d5690>

Corpus is now an object. We didn’t define any way to print it, so print just outputs address of the object in memory. Not very useful. To see the constituent vectors, let’s iterate over the corpus and print each document vector (one at a time):
遍历迭代器，实现上一次内存上只有一条记录。

>>> for vector in corpus_memory_friendly:  # load one vector into memory at a time...     print(vector)[(0, 1), (1, 1), (2, 1)][(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)][(2, 1), (5, 1), (7, 1), (8, 1)][(1, 1), (5, 2), (8, 1)][(3, 1), (6, 1), (7, 1)][(9, 1)][(9, 1), (10, 1)][(9, 1), (10, 1), (11, 1)][(4, 1), (10, 1), (11, 1)]

Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.
即使结果与上面采用的python列表的一样，可是这个对内存的利用比较好了，因为每次只有一个向量是在内存上的。语料你想多大都可以。
Similarly, to construct the dictionary without loading all texts into memory:
同理，构建一个字典也不需求把所有文本都加载入内存：

>>> from six import iteritems>>> # collect statistics about all tokens>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))>>> # remove stop words and words that appear only once>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist>>>             if stopword in dictionary.token2id]>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed>>> print(dictionary)Dictionary(12 unique tokens)

And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities. Transformations are covered in the next tutorial, but before that, let’s briefly turn our attention to corpus persistency.

Corpus Formats

语料格式

There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. Gensim implements them via the streaming corpus interface mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.
Gensim提供懒加载的形式来提高效率，每次操作都只读行写一个文档的数据。
One of the more notable file formats is the Market Matrix format. To save a corpus in the Matrix Market format:
比较有名的文档格式之一Market Matrix 格式，把一个矩阵向量保存成Market格式：

>>> # create a toy corpus of 2 documents, as a plain Python list>>> corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it>>>>>> corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

Other formats include Joachim’s SVMlight format, Blei’s LDA-C format and GibbsLDA++ format.
其它格式：Joachim’s SVMlight format，Blei’s LDA-C format，GibbsLDA++ format

>>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)>>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

Conversely, to load a corpus iterator from a Matrix Market file:
从Market文档加载一语料迭代器：

>>> corpus = corpora.MmCorpus('/tmp/corpus.mm')

Corpus objects are streams, so typically you won’t be able to print them directly:
Corpus对象是流，一般不可以直接打印出来的：

>>> print(corpus)MmCorpus(2 documents, 2 features, 1 non-zero entries)

下面是两种方法：
Instead, to view the contents of a corpus:
去果看corpus的内容：

>>> # one way of printing a corpus: load it entirely into memory>>> print(list(corpus))  # calling list() will convert any sequence to a plain Python list[[(1, 0.5)], []]

or
或

>>> # another way of doing it: print one document at a time, making use of the streaming interface>>> for doc in corpus:...     print(doc)[(1, 0.5)][]

The second way is obviously more memory-friendly, but for testing and development purposes, nothing beats the simplicity of callinglist(corpus).
第二种方法对内存比较友好。

To save the same Matrix Market document stream in Blei’s LDA-C format,
以Blei’s LDA-C 格式保存Matrix Market document流；

>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

In this way, gensim can also be used as a memory-efficient I/O format conversion tool: just load a document stream using one format and immediately save it in another format. Adding new formats is dead easy, check out the code for the SVMlight corpus for an example.
这类操作对内存都是友好的，可以查看其它的转存例子。

Compatibility with NumPy and SciPy

兼容NumPy与SciPy

Gensim also contains efficient utility functions to help converting from/to numpy matrices:
Gensim有一些有效的工具函数专门用来对numpy矩阵转化：

>>> import gensim>>> import numpy as np>>> numpy_matrix = np.random.randint(10, size=[5,2])  # random matrix as an example>>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix)>>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

and from/to scipy.sparse matrices:
对scipy进行转换

>>> import scipy.sparse>>> scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example>>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)>>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

操作实现

1. 语料向量第一例

代码：

from pprint import pprintfrom gensim import corpora# 文档列表documents = ["Human machine interface for lab abc computer applications",             "A survey of user opinion of computer system response time and time",             "The EPS user interface management system",             "System and human system engineering testing of EPS",             "Relation of user perceived response time to error measurement",             "The generation of random binary unordered trees",             "The intersection graph of paths in trees",             "Graph minors IV Widths of trees and well quasi ordering",             "Graph minors A survey"]print('原始文档:')pprint(documents)# 停用词列表，删除停用词stoplist = set('for a of the and to in'.split())texts = [[word for word in document.lower().split() if word not in stoplist]         for document in documents]# 删除只出现一次的词语from collections import defaultdictfrequency = defaultdict(int)for text in texts:    for token in text:        frequency[token] += 1texts = [[token for token in text if frequency[token] > 1]         for text in texts]# 显示结果print('语料处理结果:')pprint(texts)# 把结果保存成字典【字典表示的词与整数编码】dictionary = corpora.Dictionary(texts)dictionary.save('tmp/deerwester.dict')  # store the dictionary, for future referenceprint('语料词典：')print(dictionary)print(dictionary.token2id)# 应用词典，把文档转成向量new_doc = "Human computer interaction"new_vec = dictionary.doc2bow(new_doc.lower().split())print('新文档%s的向量:'%new_doc)# 词"interaction"在字典中没有，把它忽略了pprint(new_vec)

运行结果：

原始文档:['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time and time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']语料处理结果:[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]语料词典：Dictionary(12 unique tokens: ['interface', 'eps', 'computer', 'system', 'minors']...){'interface': 2, 'eps': 8, 'computer': 1, 'system': 5, 'minors': 11, 'response': 4, 'human': 0, 'user': 3, 'trees': 9, 'graph': 10, 'time': 6, 'survey': 7}新文档Human computer interaction的向量:[(0, 1), (1, 1)]doc2bow:[[(0, 1), (1, 1), (2, 1)], [(1, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1)], [(2, 1), (3, 1), (5, 1), (8, 1)], [(0, 1), (5, 2), (8, 1)], [(3, 1), (4, 1), (6, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(7, 1), (10, 1), (11, 1)]]

2. 文件操作例2【数据量大用这种方法处理】

from gensim import corporafrom six import iteritems# 从文档读取语料stoplist = set('for a of the and to in'.split())dictionary = corpora.Dictionary(line.lower().split() for line in open('tmp/mycorpus.txt'))stop_ids = [dictionary.token2id[stopword] for stopword in stoplist            if stopword in dictionary.token2id]once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]# remove stop words and words that appear only once# 删除停用词与出现一次的词dictionary.filter_tokens(stop_ids + once_ids)# remove gaps in id sequence after words that were removed# 由于删除停用词产生了，空的位置，这个把内存回收回来。dictionary.compactify()# print(dictionary)print('语料词典：')print(dictionary)print(dictionary.token2id)# 显示语料的处理结果class MyCorpus(object):    def __iter__(self):        for line in open('tmp/mycorpus.txt'):            # 假设一个文档一行，词由空格分隔开            yield dictionary.doc2bow(line.lower().split())    # 这个方法有些存储要，有些不要，未太明白    def __len__(self):        return 9# 并没有把所有语料加載到内存corpus_memory_friendly = MyCorpus()print('doc2bow的迭代器：')print(corpus_memory_friendly)# 每次加载一个向量到内存print('显示语料的处理结果:')for vector in corpus_memory_friendly:    print(vector)# 保存格式# Market Matrix format 格式保存corpora.MmCorpus.serialize('tmp/mycorpus.mm', corpus_memory_friendly)# Joachim’s SVMlight formatcorpora.SvmLightCorpus.serialize('tmp/mycorpus.svmlight', corpus_memory_friendly)# Blei’s LDA-C formatcorpora.BleiCorpus.serialize('tmp/mycorpus.lda-c', corpus_memory_friendly)# GibbsLDA++ formatcorpora.LowCorpus.serialize('tmp/mycorpus.low', corpus_memory_friendly)# 加载数据print('加载MmCorpus:')corpus = corpora.MmCorpus('tmp/mycorpus.mm')for doc in corpus:    print(doc)print('加载SvmLightCorpus:')corpus = corpora.SvmLightCorpus('tmp/mycorpus.svmlight')for doc in corpus:    print(doc)print('加载BleiCorpus:')corpus = corpora.BleiCorpus('tmp/mycorpus.lda-c')for doc in corpus:    print(doc)print('加载LowCorpus:')corpus = corpora.LowCorpus('tmp/mycorpus.low')for doc in corpus:    print(doc)# 加载后进行转存corpora.MmCorpus.serialize('tmp/mycorpus02.mm', corpus)

运行结果：

语料词典：Dictionary(12 unique tokens: ['trees', 'system', 'graph', 'user', 'time']...){'trees': 0, 'system': 6, 'graph': 2, 'user': 3, 'time': 4, 'human': 5, 'computer': 1, 'response': 7, 'interface': 8, 'survey': 9, 'minors': 10, 'eps': 11}doc2bow的迭代器：<__main__.MyCorpus object at 0x00000000052036A0>显示语料的处理结果:[(1, 1), (5, 1), (8, 1)][(1, 1), (3, 1), (4, 2), (6, 1), (7, 1), (9, 1)][(3, 1), (6, 1), (8, 1), (11, 1)][(5, 1), (6, 2), (11, 1)][(3, 1), (4, 1), (7, 1)][(0, 1)][(0, 1), (2, 1)][(0, 1), (2, 1), (10, 1)][(2, 1), (9, 1), (10, 1)]加载MmCorpus:[(1, 1.0), (5, 1.0), (8, 1.0)][(1, 1.0), (3, 1.0), (4, 2.0), (6, 1.0), (7, 1.0), (9, 1.0)][(3, 1.0), (6, 1.0), (8, 1.0), (11, 1.0)][(5, 1.0), (6, 2.0), (11, 1.0)][(3, 1.0), (4, 1.0), (7, 1.0)][(0, 1.0)][(0, 1.0), (2, 1.0)][(0, 1.0), (2, 1.0), (10, 1.0)][(2, 1.0), (9, 1.0), (10, 1.0)]加载SvmLightCorpus:[(1, 1.0), (5, 1.0), (8, 1.0)][(1, 1.0), (3, 1.0), (4, 2.0), (6, 1.0), (7, 1.0), (9, 1.0)][(3, 1.0), (6, 1.0), (8, 1.0), (11, 1.0)][(5, 1.0), (6, 2.0), (11, 1.0)][(3, 1.0), (4, 1.0), (7, 1.0)][(0, 1.0)][(0, 1.0), (2, 1.0)][(0, 1.0), (2, 1.0), (10, 1.0)][(2, 1.0), (9, 1.0), (10, 1.0)]加载BleiCorpus:[(1, 1.0), (5, 1.0), (8, 1.0)][(1, 1.0), (3, 1.0), (4, 2.0), (6, 1.0), (7, 1.0), (9, 1.0)][(3, 1.0), (6, 1.0), (8, 1.0), (11, 1.0)][(5, 1.0), (6, 2.0), (11, 1.0)][(3, 1.0), (4, 1.0), (7, 1.0)][(0, 1.0)][(0, 1.0), (2, 1.0)][(0, 1.0), (2, 1.0), (10, 1.0)][(2, 1.0), (9, 1.0), (10, 1.0)]加载LowCorpus:[(1, 1), (7, 1), (10, 1)][(1, 1), (5, 1), (6, 2), (8, 1), (9, 1), (11, 1)][(5, 1), (8, 1), (10, 1), (3, 1)][(7, 1), (8, 2), (3, 1)][(5, 1), (6, 1), (9, 1)][(0, 1)][(0, 1), (4, 1)][(0, 1), (4, 1), (2, 1)][(4, 1), (11, 1), (2, 1)]

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78353338】

阅读全文

1 0