sklearn CountVectorizer\TfidfVectorizer\TfidfTransformer函数详解

来源:互联网 发布:程序员过关 编辑:程序博客网 时间:2024/06/05 19:05
  • sklearn CountVectorizer函数详解
from sklearn.feature_extraction.text import CountVectorizertexts=["dog cat fish","dog cat cat","fish bird", 'bird']cv = CountVectorizer()cv_fit=cv.fit_transform(texts)print(cv.get_feature_names())print(cv_fit.toarray())print(cv_fit)
  • 返回的结果为稀疏矩阵
['bird', 'cat', 'dog', 'fish'][[0 1 1 1] [0 2 1 0] [1 0 0 1] [1 0 0 0]]  (0, 3)    1  (0, 1)    1  (0, 2)    1  (1, 1)    2  (1, 2)    1  (2, 0)    1  (2, 3)    1  (3, 0)    1
  • sklearn TfidfTransformer函数详解
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformertexts=["dog cat fish","dog cat cat","dog fish", 'dog pig pig 中国']cv = CountVectorizer()cv_fit=cv.fit_transform(texts)transformer = TfidfTransformer()tfidf = transformer.fit_transform(cv_fit)tfidf.toarray()
array([[ 0.64043405,  0.42389674,  0.64043405,  0.        ,  0.        ],       [ 0.94936136,  0.31418628,  0.        ,  0.        ,  0.        ],       [ 0.        ,  0.55193942,  0.83388421,  0.        ,  0.        ],       [ 0.        ,  0.22726773,  0.        ,  0.8710221 ,  0.43551105]])
  • sklearn TfidfVectorizer函数详解
  • TfidfVectorizer函数的功能相当于下面这四行代码的功能,即CountVectorizer+TfidfTransformer
cv = CountVectorizer()cv_fit=cv.fit_transform(texts)transformer = TfidfTransformer()tfidf = transformer.fit_transform(cv_fit)
  • 上代码,TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizertv = TfidfVectorizer(max_features=100,                     ngram_range=(1, 1),                     stop_words='english')X_description = tv.fit_transform(texts)print(X_description.toarray())
[[ 0.64043405  0.42389674  0.64043405  0.          0.        ] [ 0.94936136  0.31418628  0.          0.          0.        ] [ 0.          0.55193942  0.83388421  0.          0.        ] [ 0.          0.22726773  0.          0.8710221   0.43551105]]
  • 可观察到输出的结果和上面的结果是一毛一样的。
  • ngram_range=(1, 1)也可以改为(2,3),这就是2-gram.
  • stop_words暂时只支持英文,即”english”
阅读全文
0 0