2类分类器实践1
来源:互联网 发布:iPad看视频软件 编辑:程序博客网 时间:2024/06/08 17:27
import pandas as pdimport nltk# 定义特征提取器def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features["contains(%s)" % word] = (word in document_words) return features#读取exceldocuments0 = []documents1 = []df = pd.read_excel("window regulator01.xlsx")# print(df.head())Nodf = df[df.categories == 0]# print(len(Nodf))for rows in Nodf.values: documents0.append((nltk.word_tokenize(rows[0]) , "0"))Yesdf = df[df.categories == 1]# print(len(Yesdf))for rows in Yesdf.values: documents1.append( (nltk.word_tokenize(rows[0]) , "1") )#分析整个文本,将频率最大的2000个单词作为特征sentences = "" #将每个title合成字符串titles = df[ "title"]for title in titles: sentences = sentences + " "+ titlewords = nltk.word_tokenize(sentences)print(len(words) / len(titles)) #计算平均每个title的单词个数all_words = nltk.FreqDist(w.lower() for w in words) #小写后统计词频word_features = list(all_words.keys())[:2000] #将整个语料库中前2000个高濒词作为特征print(word_features)#特征提取函数#训练和测试分类器 0代表非window regulator样本 1代表是featuresets0 = [(document_features(d , word_features) , c) for (d , c) in documents0]featuresets1 = [(document_features(d , word_features) , c) for (d , c) in documents1]train_set = featuresets0[:2000]train_set.extend(featuresets1[:2000])print(len(train_set))test_set = featuresets0[2000:]test_set.extend(featuresets1[2000:])print(len(test_set))classifier = nltk.NaiveBayesClassifier.train(train_set) #训练分类器print(nltk.classify.accuracy(classifier , test_set))#分类正确率print(classifier.show_most_informative_features(20)) #分类器发现的最有信息量的特征
第二版
import pandas as pdimport nltk# 定义特征提取器def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features["contains(%s)" % word] = (word in document_words) return features#读取exceldocuments0 = []documents1 = []df = pd.read_excel("window regulator01.xlsx")Nodf = df[df.categories == 0]# print(len(Nodf))for rows in Nodf.values: documents0.append((nltk.word_tokenize(rows[0]) , "0"))Yesdf = df[df.categories == 1]# print(len(Yesdf))for rows in Yesdf.values: documents1.append( (nltk.word_tokenize(rows[0]) , "1") )#分析整个文本,将频率最大的2000个单词作为特征sentences = "" #将每个title合成字符串titles = df[ "title"]for title in titles: sentences = sentences + " "+ titlewords = nltk.word_tokenize(sentences)print(len(words) / len(titles)) #计算平均每个title的单词个数all_words = nltk.FreqDist(w.lower() for w in words) #小写后统计词频print("词汇总数 = %d" % len( all_words))common = all_words.most_common(500) #出现次数最多的前n个高频词列表word_features = [] #将整个语料库中前n个高濒词作为特征for item in common: word_features.append(item[0]) #获取元祖的第一个元素keyprint("特征列表数 = %d" %len(word_features))print(word_features)# all_words.plot(50)#特征提取函数#训练和测试分类器 0代表非window regulator样本 1代表是featuresets0 = [(document_features(d , word_features) , c) for (d , c) in documents0]featuresets1 = [(document_features(d , word_features) , c) for (d , c) in documents1]selectsample = 2000train_set = featuresets0[:selectsample]train_set.extend(featuresets1[:selectsample])print("训练样本数 = %d" %len(train_set))test_set = featuresets0[selectsample:]test_set.extend(featuresets1[selectsample:])print("测试样本数 = %d" %len(test_set))print("训练中。。。")classifier = nltk.NaiveBayesClassifier.train(train_set) #训练分类器print("测试中。。。")print("分类正确率 = %f" %(nltk.classify.accuracy(classifier , test_set)))#分类正确率print("最有信息量的特征")print(classifier.show_most_informative_features(20)) #分类器发现的最有信息量的特征
阅读全文
0 0
- 2类分类器实践1
- Linux信号实践(2) --信号分类
- Linux信号实践(2) --信号分类
- Linux信号实践(2) --信号分类
- Linux信号实践(2) --信号分类
- opencv实践程序9——分类器训练过程
- 机器学习及实践 2.1.1.1 线性分类器
- SVM分类器实践,检测是否有篮球
- CNN在NLP领域的实践(1) 文本分类
- 机器学习理论与实践系列(1)-分类算法
- 多类分类器
- OpenCV 实践程序10——利用 haar特征和adaboost方法训练分类器
- OpenCV 实践程序12——用分类器对视频进行人脸检测
- cs231n一次课程实践,python实现softmax线性分类器和二层神经网络
- python数据挖掘与入门实践(2.1)用sciket-learn估计器分类
- python数据挖掘与入门实践(2.2)用sciket-learn估计器分类
- Python数据挖掘入门与实践(二)——用scikit-learn估计器分类
- python pandas 基本使用,sklearn的10种分类器实践
- linux -> C/C++ 目录操作
- pythonj解析xml
- .深入剖析volatile关键字
- PopupWindow获取不到高度
- 练习1
- 2类分类器实践1
- 欢迎使用CSDN-markdown编辑器
- 联动comobo
- git常用命令
- 关于c/c++中static
- czl蒻蒟的OI之路5
- MediaPlayer
- mysql 索引优化、使用原则及注意事项
- 栈