scikit-learn计算tf-idf词语权重
来源:互联网 发布:学简单英语口语的软件 编辑:程序博客网 时间:2024/04/28 19:29
采用scikit-learn包进行tf-idf分词权重计算关键用到了两个类:CountVectorizer和TfidfTransformer
CountVectorizer
TfidfTransformer
TfidfVectorizer 个数+归一化(不包括idf)
vectorizer=CountVectorizer() #该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
count=vectorizer.fit_transform(corpus)#将文本转为词频矩阵
transformer=TfidfTransformer()#该类会统计每个词语的tf-idf权值
tfidf=transformer.fit_transform(count)#计算tf-idf
TfidfVec=TfidfVectorizer()
count2=TfidfVec.fit_transform(corpus)
# coding:utf-8 可用中文注释
# coding:utf-8 __author__ = "liuxuejiang" #import jieba #import jieba.posseg as pseg import os import sys from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer if __name__ == "__main__": corpus = [ 'Today the weather is sunny', #第一类文本切词后的结果,词之间以空格隔开 'Sunny day weather is suitable to exercise ', #第二类文本切词后的结果 'I ate a Hotdog' ] #第三类文本切词后的结果vectorizer=CountVectorizer() #该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频 count=vectorizer.fit_transform(corpus)#将文本转为词频矩阵print(vectorizer.vocabulary_)word=vectorizer.get_feature_names()#获取词袋模型中的所有词语 print(word) print(vectorizer.fit_transform(corpus))print(vectorizer.fit_transform(corpus).todense())#显示词频矩阵transformer=TfidfTransformer()#该类会统计每个词语的tf-idf权值 tfidf=transformer.fit_transform(count)#计算tf-idf print(tfidf)weight=tfidf.toarray()#将tf-idf矩阵抽取出来,元素a[i][j]表示j词在i类文本中的tf-idf权重 print(weight)for i in range(len(weight)):#打印每类文本的tf-idf词语权重,第一个for遍历所有文本,第二个for便利某一类文本下的词语权重 print u"-------这里输出第",i+1,u"类文本的词语tf-idf权重------" for j in range(len(word)): print word[j],weight[i][j] TfidfVec=TfidfVectorizer()count2=TfidfVec.fit_transform(corpus)print("--------直接使用TfidfVectorizer()-------")print(TfidfVec.fit_transform(corpus).todense())
输出:
{u'ate': 0, u'is': 4, u'sunny': 6, u'to': 8, u'weather': 10, u'today': 9, u'the': 7, u'suitable': 5, u'day': 1, u'exercise': 2, u'hotdog': 3}[u'ate', u'day', u'exercise', u'hotdog', u'is', u'suitable', u'sunny', u'the', u'to', u'today', u'weather'] (0, 6)1 (0, 4)1 (0, 10)1 (0, 7)1 (0, 9)1 (1, 2)1 (1, 8)1 (1, 5)1 (1, 1)1 (1, 6)1 (1, 4)1 (1, 10)1 (2, 3)1 (2, 0)1[[0 0 0 0 1 0 1 1 0 1 1] [0 1 1 0 1 1 1 0 1 0 1] [1 0 0 1 0 0 0 0 0 0 0]] (0, 9)0.517419943932 (0, 7)0.517419943932 (0, 10)0.393511204094 (0, 4)0.393511204094 (0, 6)0.393511204094 (1, 10)0.317570180428 (1, 4)0.317570180428 (1, 6)0.317570180428 (1, 1)0.417566623878 (1, 5)0.417566623878 (1, 8)0.417566623878 (1, 2)0.417566623878 (2, 0)0.707106781187 (2, 3)0.707106781187[[ 0. 0. 0. 0. 0.3935112 0. 0.3935112 0.51741994 0. 0.51741994 0.3935112 ] [ 0. 0.41756662 0.41756662 0. 0.31757018 0.41756662 0.31757018 0. 0.41756662 0. 0.31757018] [ 0.70710678 0. 0. 0.70710678 0. 0. 0. 0. 0. 0. 0. ]]-------这里输出第 1 类文本的词语tf-idf权重------ate 0.0day 0.0exercise 0.0hotdog 0.0is 0.393511204094suitable 0.0sunny 0.393511204094the 0.517419943932to 0.0today 0.517419943932weather 0.393511204094-------这里输出第 2 类文本的词语tf-idf权重------ate 0.0day 0.417566623878exercise 0.417566623878hotdog 0.0is 0.317570180428suitable 0.417566623878sunny 0.317570180428the 0.0to 0.417566623878today 0.0weather 0.317570180428-------这里输出第 3 类文本的词语tf-idf权重------ate 0.707106781187day 0.0exercise 0.0hotdog 0.707106781187is 0.0suitable 0.0sunny 0.0the 0.0to 0.0today 0.0weather 0.0[[ 0. 0. 0. 0. 0.3935112 0. 0.3935112 0.51741994 0. 0.51741994 0.3935112 ] [ 0. 0.41756662 0.41756662 0. 0.31757018 0.41756662 0.31757018 0. 0.41756662 0. 0.31757018] [ 0.70710678 0. 0. 0.70710678 0. 0. 0. 0. 0. 0. 0. ]][Finished in 0.6s]
中文的情况
中文分词采用的jieba分词,安装jieba分词包
1 安装scikit-learn包
3 关于jieba分词的使用非常简单,参考这里,关键的语句就是(这里简单试水,不追求效果4 )输出结果:
对 p
这 r
句 q
话 n
进行 v
分词 n
4 采用scikit-learn包进行tf-idf分词权重计算关键用到了两个类:CountVectorizer和TfidfTransformer,具体参见这里
一个简单的代码如下:
程序输出:每行格式为:词语 tf-idf权重使用scikit-learn来计算一个简单的词频
CountVectorizer
import pandas as pdimport numpy as npfrom sklearn.feature_extraction.text import CountVectorizertexts=["dog cat fish","dog cat cat","fish bird","bird"]cv = CountVectorizer()cv_fit=cv.fit_transform(texts)print cv.vocabulary_{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}
在这种情况下,这是一个dict,其中的键是您找到的单词(功能),值是索引
cv.vocabulary_0, 1, 2, 3,不是词频排序。
您需要使用cv_fit
对象来获取计数
from sklearn.feature_extraction.text import CountVectorizertexts=["dog cat fish","dog cat cat","fish bird", 'bird']cv = CountVectorizer()cv_fit=cv.fit_transform(texts)print(cv.get_feature_names())print(cv_fit.toarray())#['bird', 'cat', 'dog', 'fish']#[[0 1 1 1]# [0 2 1 0]# [1 0 0 1]# [1 0 0 0]]
数组中的每一行都是您的原始文档(字符串)之一,每列都是一个特征(单词),该元素是该特定单词和文档的计数。你可以看到,如果你把每列相加,你会得到正确的数字
print(cv_fit.toarray().sum(axis=0))#[2 3 2 2]
老实说,我建议使用collections.Counter
或从NLTK的东西,除非你有一些具体的理由使用scikit学习,因为它会更简单。
from collectionsimport Counter
def build_vocab(sentences): """ Builds a vocabulary mapping from word to index based on the sentences. Returns vocabulary mapping and inverse vocabulary mapping. """ # Build vocabulary word_counts = Counter(itertools.chain(*sentences)) # Mapping from index to word vocabulary_inv = [x[0] for x in word_counts.most_common()] # Mapping from word to index vocabulary = {x: i for i, x in enumerate(vocabulary_inv)} return [vocabulary, vocabulary_inv] ''' import collections sentence = ["i", "love", "mom", "mom", "loves", "me"] collections.Counter(sentence) >>> Counter({'i': 1, 'love': 1, 'loves': 1, 'me': 1, 'mom': 2}) '''
- scikit-learn计算tf-idf词语权重
- python scikit-learn计算tf-idf词语权重
- python scikit-learn计算tf-idf词语权重
- python scikit-learn计算tf-idf词语权重
- 使用scikit-learn tfidf计算词语权重
- scikit-learn 进行tf-idf计算
- scikit-learn包进行tf-idf计算
- [python] 使用scikit-learn工具计算文本TF-IDF值
- 使用scikit-learn工具计算文本TF-IDF值
- [python] 使用scikit-learn工具计算文本TF-IDF值
- 如何用scikit-learn求TF-IDF
- 关键词权重计算算法 - TF-IDF
- python 使用sklearn计算TF-IDF权重
- TF-IDF词项权重计算
- TF-IDF词项权重计算
- 三十三、利用scikit-learn计算tf-idf做文本词频分析
- 使用sci-kit learn计算TF-IDF
- scikit-learn:从文本文件中提取特征(tf、idf)
- C/C++_log2000_2017春季算法实验1_3
- Linux运维笔记-文档总结-Samba文件共享
- Codeforces Round #412 (rated, Div. 2, base on VK Cup 2017 Round 3) C
- Qt仿QQ界面,主界面、聊天界面、表情界面
- MYSQL数据库导入数据时出现乱码的解决办法
- scikit-learn计算tf-idf词语权重
- AndroidStudio项目提交(更新)到github最详细步骤
- java8新特性 (λ、stream 与 默认接口方法)
- spring+springmvc+mybatis整合之登录+文件上传
- 16级C++课程设计 第二题
- 网站静态化处理—反向代理(10)
- 排序总结
- C++面试题集合(一)
- 如何在阿里云linux服务器查看应用日志--部署在tomcat