scikit-learn：0.3. 从文本文件中提取特征（tf、tf-idf）、训练一个分类器

来源：互联网发布：dsa数据编辑：程序博客网时间：2024/06/06 02:30

上一篇讲了如何加载数据。

本篇参考：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

主要讲解如下部分：

Extracting features from text files

Training a classifier

跑模型之前，需要将文本文件的内容转换为数字特征向量。常见的是tf、tf-idf。

1、tf：

首先解决high-dimensional sparse datasets：scipy.sparse matrices就是解决这个问题，scikit-learn 已经内置了该数据结构（built-in support for these structures）。

[python] view plain copy
from sklearn.feature_extraction.text import CountVectorizer  
count_vect = CountVectorizer()  
X_train_counts = count_vect.<strong>fit_transform</strong>(rawData.data)  
  
X_train_counts  
Out[43]:   
<6x11 sparse matrix of type '<type 'numpy.int64'>'  
    with 18 stored elements in Compressed Sparse Row format>  
  
X_train_counts.shape  
Out[44]: (6, 11)  
  
print count_vect.vocabulary_.get(u'like')  
print count_vect.vocabulary_.get(u'good')  
3  
1  
  
print rawData_counts  
  (0, 8)        1  
  (0, 0)        1  
  (0, 3)        1  
  (1, 8)        1  
  (1, 3)        1  
  (1, 10)       1  
  (1, 9)        1  
  (2, 8)        1  
  (2, 4)        1  
  (3, 8)        1  
  (3, 6)        1  
  (3, 1)        1  
  (4, 8)        1  
  (4, 2)        1  
  (5, 8)        1  
  (5, 1)        1  
  (5, 5)        1  
  (5, 7)        1  

2、tf-idf：

[python] view plain copy
from sklearn.feature_extraction.text import TfidfTransformer  
tfidf_transformer = TfidfTransformer()  
X_train_tfidf = tfidf_transformer.<strong>fit_transform</strong>(rawData_counts)  
X_train_tfidf.shape  
Out[53]: (6, 11)  
  
X_train_tfidf  
Out[54]:   
<6x11 sparse matrix of type '<type 'numpy.float64'>'  
    with 18 stored elements in Compressed Sparse Row format>  
  
print X_train_tfidf  
  (0, 3)        0.599738830611  
  (0, 0)        0.731376058697  
  (0, 8)        0.324657351406  
  (1, 9)        0.590335838052  
  (1, 10)       0.590335838052  
  (1, 3)        0.484083832074  
  (1, 8)        0.262049690228  
  (2, 4)        0.913996360826  
  (2, 8)        0.405722383406  
  (3, 1)        0.599738830611  
  (3, 6)        0.731376058697  
  (3, 8)        0.324657351406  
  (4, 2)        0.913996360826  
  (4, 8)        0.405722383406  
  (5, 7)        0.590335838052  
  (5, 5)        0.590335838052  
  (5, 1)        0.484083832074  
  (5, 8)        0.262049690228  

3、训练一个分类器：

以naive bayes为例：

[python] view plain copy
from sklearn.naive_bayes import MultinomialNB  
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)  

4、预测：

新文件来了，需要进行完全相同的特征提取过程。不同之处是，我们使用“transform instead of fit_transform on the transformers”，因为我们已经在训练集上fit了：

[python] view plain copy
from sklearn.naive_bayes import MultinomialNB  
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)  
docs_new = ['i like this', 'haha, start.']  
X_new_counts = count_vect.<strong>transform</strong>(docs_new)  
X_new_tfidf = tfidf_transformer.transform(X_new_counts)  
predicted = clf.predict(X_new_tfidf)  

[python] view plain copy
for doc, category in zip(docs_new, predicted):  
    print('%r => %s' % (doc, rawData.target_names[category]))  
'i like this' => category_2_folder  
'haha, start.' => category_1_folder  

看来简单预测还是比较准确的啊。。。。

Extracting features from text files

1 0