Corpora and Vector Spaces

>>> import logging>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


从字符串到向量(From Strings to Vectors)

>>> from gensim import corpora>>>>>> documents = ["Human machine interface for lab abc computer applications",>>>              "A survey of user opinion of computer system response time",>>>              "The EPS user interface management system",>>>              "System and human system engineering testing of EPS",>>>              "Relation of user perceived response time to error measurement",>>>              "The generation of random binary unordered trees",>>>              "The intersection graph of paths in trees",>>>              "Graph minors IV Widths of trees and well quasi ordering",>>>              "Graph minors A survey"]


>>> # remove common words and tokenize>>> stoplist = set('for a of the and to in'.split())>>> texts = [[word for word in document.lower().split() if word not in stoplist]>>>          for document in documents]>>>>>> # remove words that appear only once>>> from collections import defaultdict>>> frequency = defaultdict(int)>>> for text in texts:>>>     for token in text:>>>         frequency[token] += 1>>>>>> texts = [[token for token in text if frequency[token] > 1]>>>          for text in texts]>>>>>> from pprint import pprint  # pretty-printer>>> pprint(texts)[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

下面,描述一个常用的方法(叫为bag-of words),但是记住不同的应用领域需要不同的特征;
为了把文本转化为向量,我们使用一个文档表示叫做bag-of -words;

>>> dictionary = corpora.Dictionary(texts)>>>'/tmp/deerwester.dict')  # store the dictionary, for future reference>>> print(dictionary)Dictionary(12 unique tokens)

这里我们给 出现在语料中的所有的单词分配一个唯一的整数id,使用 gensim.corpora.dictionary.Dictionary类

>>> print(dictionary.token2id){'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}


>>> new_doc = "Human computer interaction">>> new_vec = dictionary.doc2bow(new_doc.lower().split())>>> print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored[(0, 1), (1, 1)]

方法doc2bow()简单的统计每一个不同的单词出现的次数,把单词转化为相应的id。以稀疏向量的形式进行返回结果;稀疏向量[(0, 1), (1, 1)]可以被解读为在文档“Human computer interaction”中,单词“computer”(id 0)和human(id 1)出现一次;词表中的其他单词出现0次;

>>> corpus = [dictionary.doc2bow(text) for text in texts]>>> corpora.MmCorpus.serialize('/tmp/', corpus)  # store to disk, for later use>>> print(corpus)[(0, 1), (1, 1), (2, 1)][(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)][(2, 1), (5, 1), (7, 1), (8, 1)][(1, 1), (5, 2), (8, 1)][(3, 1), (6, 1), (7, 1)][(9, 1)][(9, 1), (10, 1)][(9, 1), (10, 1), (11, 1)][(4, 1), (10, 1), (11, 1)]

语料流—一次只有一个文档(Corpus Streaming – One Document at a Time)

>>> class MyCorpus(object):>>>     def __iter__(self):>>>         for line in open('mycorpus.txt'):>>>             # assume there's one document per line, tokens separated by whitespace>>>             yield dictionary.doc2bow(line.lower().split())


>>> corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!>>> print(corpus_memory_friendly)<__main__.MyCorpus object at 0x10d5690>


>>> for vector in corpus_memory_friendly:  # load one vector into memory at a time...     print(vector)[(0, 1), (1, 1), (2, 1)][(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)][(2, 1), (5, 1), (7, 1), (8, 1)][(1, 1), (5, 2), (8, 1)][(3, 1), (6, 1), (7, 1)][(9, 1)][(9, 1), (10, 1)][(9, 1), (10, 1), (11, 1)][(4, 1), (10, 1), (11, 1)]


>>> from six import iteritems>>> # collect statistics about all tokens>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))>>> # remove stop words and words that appear only once>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist>>>             if stopword in dictionary.token2id]>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed>>> print(dictionary)Dictionary(12 unique tokens)


语料格式(Corpus Formats)
一种更加值得注意的文件格式是 Market Matrix format,为了使用Market Matrix format保存语料:

>>> # create a toy corpus of 2 documents, as a plain Python list>>> corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it>>>>>> corpora.MmCorpus.serialize('/tmp/', corpus)

另外的格式包括: Joachim’s SVMlight format, Blei’s LDA-C format and GibbsLDA++ format.

>>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)>>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

方便的,从 Matrix Market file加载语料迭代器:

>>> corpus = corpora.MmCorpus('/tmp/')


>>> print(corpus)MmCorpus(2 documents, 2 features, 1 non-zero entries)


>>> # one way of printing a corpus: load it entirely into memory>>> print(list(corpus))  # calling list() will convert any sequence to a plain Python list[[(1, 0.5)], []]


>>> # another way of doing it: print one document at a time, making use of the streaming interface>>> for doc in corpus:...     print(doc)[(1, 0.5)][]

为了以 Blei’s LDA-C format保存同样的Matrix Market document stream

>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

与 NumPy and SciPy的兼容性
Gensim还包括有效的实用功能(efficient utility functions ),用来转化numpy 矩阵:

>>> import gensim>>> import numpy as np>>> numpy_matrix = np.random.randint(10, size=[5,2])  # random matrix as an example>>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix)>>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

转化为scipy.sparse matrices:

>>> import scipy.sparse>>> scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example>>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)>>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

为提供一个完整的参考, API documentation.


