使用scikit-learn进行文本分类
来源:互联网 发布:linux新建文件夹命令 编辑:程序博客网 时间:2024/05/29 08:34
1. 数据来源
所用的数据是分类好的数据,详细描述见SMS Spam Collection v. 1,可以从github下载,数据在第4章。每一行数据包括包括两列,使用逗号隔开, 第1列是分类(lable),第2列是文本。
sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])sms.headOut[5]: <bound method DataFrame.head of label text0 ham Go until jurong point, crazy.. Available only ...1 ham Ok lar... Joking wif u oni...2 spam Free entry in 2 a wkly comp to win FA Cup fina...3 ham U dun say so early hor... U c already then say...4 ham Nah I don't think he goes to usf, he lives aro...5 spam FreeMsg Hey there darling it's been 3 week's n...6 ham Even my brother is not like to speak with me. ...7 ham As per your request 'Melle Melle (Oru Minnamin...8 spam WINNER!! As a valued network customer you have...9 spam Had your mobile 11 months or more? U R entitle...10 ham I'm gonna be home soon and i don't want to tal...2. 数据准备
总共有5574行数据,随机从中抽取500行作为测试数据集,其它的作为训练数据集,为此定义了一个函数。运行后发现这个函数有一点小问题,它取不到500个数据,会少几个,分析原因,应该是产生的随机数有重复导致的。n为抽取的数据行数,size是整个数据集的行数。
def randomSequence(n, size): result = [0 for i in range(size)] for i in range(n): x = random.randrange(0, size-1, 1) result[x] = 1 return result3. 特征提取
进行文本分类,在调用算法之前需要将文本内容转换成特征。 scikit-learn提供的CountVectorizer, TfidfTransformer两个类可以完成特征的提取。测试数据集共用了训练数据集产生的词汇表。
4.完整的代码
# -*- coding: utf-8 -*-import randomimport pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.naive_bayes import MultinomialNB #生成选择训练数据和测试数据的随机序列def randomSequence(n, size): result = [0 for i in range(size)] for i in range(n): x = random.randrange(0, size-1, 1) result[x] = 1 return result if __name__ == '__main__': #读数据 filename = 'data/sms_spam.csv' sms = pd.read_csv(filename, sep=',', header=0, names=['label','text']) #拆分训练数据集和测试数据集 size = len(sms) sequence = randomSequence(500, size) sms_train_mask = [sequence[i]==0 for i in range(size)] sms_train = sms[sms_train_mask] sms_test_mask = [sequence[i]==1 for i in range(size)] sms_test = sms[sms_test_mask] #文本转换成TF-IDF向量 train_labels = sms_train['label'].values train_features = sms_train['text'].values count_v1= CountVectorizer(stop_words = 'english', max_df = 0.5, decode_error = 'ignore') counts_train = count_v1.fit_transform(train_features) #print(count_v1.get_feature_names()) #repr(counts_train.shape) tfidftransformer = TfidfTransformer() tfidf_train = tfidftransformer.fit(counts_train).transform(counts_train) test_labels = sms_test['label'].values test_features = sms_test['text'].values count_v2 = CountVectorizer(vocabulary=count_v1.vocabulary_,stop_words = 'english', max_df = 0.5, decode_error = 'ignore') counts_test = count_v2.fit_transform(test_features) tfidf_test = tfidftransformer.fit(counts_test).transform(counts_test) #训练 clf = MultinomialNB(alpha = 0.01) clf.fit(tfidf_train, train_labels) #预测 predict_result = clf.predict(tfidf_test) #print(predict_result) #正确率 correct = [test_labels[i]==predict_result[i] for i in range(len(predict_result))] r = len(predict_result) t = correct.count(True) f = correct.count(False) print(r, t, f, t/float(r) )以上用的是贝叶斯分类算法,也可以换其他算法。
运行结果
runfile('E:/MyProject/_python/ScikitLearn/NaiveBayes.py', wdir='E:/MyProject/_python/ScikitLearn')(476, 468, 8, 0.9831932773109243)
阅读全文
0 0
- 使用scikit-learn进行文本分类
- Scikit-learn中使用SVM对文本进行分类
- Python 文本分类:使用scikit-learn 机器学习包进行文本分类
- 使用scikit-learn进行音乐分类
- python scikit learn 文本分类
- scikit - learn 做文本分类
- 使用scikit-learn的随机森林对西瓜进行分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 应用scikit-learn做文本分类
- 分别使用sk-learn和mllib进行文本情感分类
- 使用scikit-learn进行机器学习(scikit-learn教程1)
- scikit-learn 常用分类算法的使用
- Could not find result map 错误
- Leetcode Search in Rotated Sorted Array II
- 正被停用的激活上下文不是最近激活的
- jdk源码解读-并发包-Lock-ReentrantReadWriteLock(1)-整体介绍以及读锁的lock 和 unlock 解析
- 图像归一化作用和方法
- 使用scikit-learn进行文本分类
- mongodb update 重命名列
- Spring MVC 入门示例讲解
- 最优二叉搜索树自底向上非递归的动态规划算法
- Webview 全面详解
- Java实现批量修改文件名
- 支付宝遇到“创建交易异常,请从新创建后在付款”
- kubernetes容器编排系统介绍
- app上传时,ERROR ITMS-90125,ERROR ITMS-90087,ERROR ITMS-90209错误