在微博文本分类任务中应用五折交叉验证(5-fold crossValidation)
来源:互联网 发布:java实现线程安全 编辑:程序博客网 时间:2024/04/26 04:01
一、交叉验证定义
交叉验证(Cross Validation)是一种训练和验证分类器性能的统计分析方法,其基本思想是将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做测试集(test set)。首先用训练集对分类器进行训练,在利用测试集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。k折交叉验证即是指将原始数据集分为k份,以k-1份为训练集,剩余1份为测试集,循环k次以实现交叉。
二、应用思路
在之前完成的微博文本分类任务中,对训练集/测试集的划分是随机按7/3比例划分的。为了应用交叉验证的思想,对数据集的划分进行了改进,其步骤是:
- 将标记为“男”的样本和标记为“女”的样本分别随机分为五等份(这么做的目的是为了维持子集和原始集的数据分布一致性)。
- 将“男”样本子集与“女”样本子集两两拼接,得到分布一致的五个子集。
- 以子集中的四个为训练集,训练SVM分类器,并对剩余的一个测试集进行分类,得到预测结果。
- 循环五次后,将真实值(五组)和预测值(五组)输入到分类器评价函数中,得到相关评价指标。
三、python code
# -*- coding: utf-8 -*-import warningswarnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')import jiebaimport gensimfrom gensim.models import word2vecfrom gensim.models.word2vec import LineSentenceimport sklearnfrom sklearn import svmfrom sklearn import preprocessingfrom sklearn import metricsimport reimport osimport sysimport randomimport numpyreload(sys)sys.setdefaultencoding('utf8')#在字符串中提取中文def get_chinese(str): line = str.strip().decode('utf-8', 'ignore') p2 = re.compile(ur'[^\u4e00-\u9fa5]') # 中文的编码范围是:\u4e00到\u9fa5 zh = "".join(p2.split(line)).strip() return zh#获取用户编号和性别的字典,用于查表def get_sex_dic(): number_sex_dic = {} print "获取用户编号-性别字典中..." male_txt = open(r'C:\Users\Administrator\Desktop\data\Data_weibo_male_female\GenderUserID\male.txt') female_txt = open(r'C:\Users\Administrator\Desktop\data\Data_weibo_male_female\GenderUserID\female.txt') male_all_lines = male_txt.readlines() female_all_lines = female_txt.readlines() for line in male_all_lines: number_sex_dic[line.replace('\n','')] = 1 for line in female_all_lines: number_sex_dic[line.replace('\n', '')] = 0 print "获取用户编号-性别字典完毕!" return number_sex_dic#word2vec函数,用于获得每个用户微博文档的词向量,同时获得其编号用作查表def get_wordvec(dirpath,process,group,stopkey_list): num_vec_dic = {} #word2vec训练出的文档向量与用户编号的对应词典 i=0 #循环计数,可用于控制读入的文件数 word2vec_text = '' #用于训练词向量的微博文本,每个用户的文本合并成一行(分词后) word2vec_list = [] #用于求文档向量的微博文本列表 line_vec_list = [] #文档向量列表 line_number_list = group #用户编号列表 stop_num = 0 #遍历训练文件夹,提取txt中的汉字 print process, ":分词中..." for number in line_number_list: file_content = '' #读取文件内容 all_lines = open(os.path.join(dirpath, number+'.txt')).readlines() for line in all_lines: file_content += line file_content = file_content.replace('\n','') #放入一行内 file_chinese = get_chinese(file_content) #只保留汉字 participle_weibo = jieba.cut(file_chinese,cut_all=False) #jieba分词 participle_weibo = " ".join(participle_weibo) #分词标志为空 # for m in range(0, len(stopkey_list)): # if (' ' + stopkey_list[m] + ' ') in participle_weibo: # stop_num += 1 # participle_weibo = participle_weibo.replace((' '+stopkey_list[m]+' '),' ') word2vec_text += participle_weibo + '\n' #得到微博文本字符串与微博文本列表 word2vec_list.append(participle_weibo.split(' ')) i+=1#启动循环,可选择读入文件数目 #if (i>10) :break text_file = open(os.path.join(r'C:\Users\Administrator\Desktop\data\Data_weibo_male_female\Weibos',process+'.txt'),'w+') #将按行排序的分词后文本写入txt text_file.write(word2vec_text) text_file.close() print process, ":分词完毕!" #调用gensim中的word2vec模型获取词向量 print process, ":训练词向量中..." sentences = word2vec.LineSentence(os.path.join(r'C:\Users\Administrator\Desktop\data\Data_weibo_male_female\Weibos',process+'.txt')) model = word2vec.Word2Vec(sentences, size=5, min_count=1, window=6) print model print process,":词向量训练完毕!" # 对一行的文本词向量求平均值后,添加到文档向量列表 print process, ":训练文档向量中..." for m in range(0,len(word2vec_list)): wordvec_ave = [i/len(word2vec_list[m]) for i in sum(model[word2vec_list[m]])] line_vec_list.append(wordvec_ave) for i in range(0,len(line_number_list)) : #构建字典,方便函数调用 num_vec_dic[line_number_list[i]] = line_vec_list[i] print process, ":文档向量训练完毕!" return num_vec_dic#评估函数:true = [真实组1,真实组2...真实组N],predict = [预测组1,预测组2...预测组N]def evaluation(true,predict): num = len(true)#确定有几组 (TP, FP, FN, TN) = ([0] * num for i in range(4))#赋初值 for m in range(0,len(true)): if(len(true[m]) != len(predict[m])):#样本数都不等,显然是有错误的 print "真实结果与预测结果样本数不一致。" else: for i in range(0,len(true[m])):#对每一组数据分别计数 if (predict[m][i] == 1) and ((true[m][i] == 1)): TP[m] += 1.0 elif (predict[m][i] == 1) and ((true[m][i] == 0)): FP[m] += 1.0 elif (predict[m][i] == 0) and ((true[m][i] == 1)): FN[m] += 1.0 elif (predict[m][i] == 0) and ((true[m][i] == 0)): TN[m] += 1.0 # macro度量,先求每一组的评价指标,再求均值 (accuracy_macro, \ precision1_macro, precision0_macro, \ recall1_macro, recall0_macro, \ F1_score1_macro,F1_score0_macro) = \ ([0] * num for i in range(7)) for m in range(0,num): accuracy_macro[m] = (TP[m] + TN[m]) / (TP[m] + FP[m] + FN[m] +TN[m]) if (TP[m] + FP[m] == 0) : precision1_macro[m] = 0#预防一些分母为0的情况 else :precision1_macro[m] = TP[m] / (TP[m] + FP[m]) if (TN[m] + FN[m] == 0) : precision0_macro[m] = 0 else :precision0_macro[m] = TN[m] / (TN[m] + FN[m]) if (TP[m] + FN[m] == 0) : recall1_macro[m] = 0 else :recall1_macro[m] = TP[m] / (TP[m] + FN[m]) if (TN[m] + FP[m] == 0) : recall0_macro[m] = 0 recall0_macro[m] = TN[m] / (TN[m] + FP[m]) macro_accuracy = numpy.mean(accuracy_macro) macro_precision1 = numpy.mean(precision1_macro) macro_precision0 = numpy.mean(precision0_macro) macro_recall1 = numpy.mean(recall1_macro) macro_recall0 = numpy.mean(recall0_macro) #F1_score还是按这个公式来算,用macro-P和macro-R if (macro_precision1 + macro_recall1 == 0): macro_F1_score1 = 0 else: macro_F1_score1 = 2 * macro_precision1 * macro_recall1 / (macro_precision1 + macro_recall1) if (macro_precision0 + macro_recall0 == 0): macro_F1_score0 = 0 else: macro_F1_score0 = 2 * macro_precision0 * macro_recall0 / (macro_precision0 + macro_recall0) print "%-20s"%"macro_accuracy" , " :%.4f\n" % macro_accuracy, \ "%-20s"%"macro_precision1" , " :%.4f\n" % macro_precision1, \ "%-20s"%"macro_precision0" , " :%.4f\n" % macro_precision0, \ "%-20s"%"macro_recall1" , " :%.4f\n" % macro_recall1, \ "%-20s"%"macro_recall0" , " :%.4f\n" % macro_recall0, \ "%-20s"%"macro_F1_score1" , " :%.4f\n" % macro_F1_score1, \ "%-20s"%"macro_F1_score0" , " :%.4f\n" % macro_F1_score0 #micro度量,是用TP、TN、FP、FN的均值来计算评价指标 TPM = numpy.mean(TP) TNM = numpy.mean(TN) FPM = numpy.mean(FP) FNM = numpy.mean(FN) micro_accuracy = (TPM + TNM) / (TPM + FPM + FNM + TNM) if(TPM + FPM ==0): micro_precision1 = 0#预防一些分母为0的情况 else: micro_precision1 = TPM / (TPM + FPM) if(TNM + FNM ==0): micro_precision0 = 0 else: micro_precision0 = TNM / (TNM + FNM) if (TPM + FNM == 0):micro_recall1 = 0 else: micro_recall1 = TPM / (TPM + FNM) if (TNM + FPM == 0):micro_recall0 = 0 else: micro_recall0 = TNM / (TNM + FPM) # F1_score仍然按这个公式来算,用micro-P和micro-R if (micro_precision1 + micro_recall1 == 0): micro_F1_score1 = 0 else :micro_F1_score1 = 2 * micro_precision1 * micro_recall1 / (micro_precision1 + micro_recall1) if (micro_precision0 + micro_recall0 == 0): micro_F1_score0 = 0 else :micro_F1_score0 = 2 * micro_precision0 * micro_recall0 / (micro_precision0 + micro_recall0) print "%-20s"%"micro_accuracy" , " :%.4f\n" % micro_accuracy, \ "%-20s"%"micro_precision1" , " :%.4f\n" % micro_precision1, \ "%-20s"%"micro_precision0" , " :%.4f\n" % micro_precision0, \ "%-20s"%"micro_recall1" , " :%.4f\n" % micro_recall1, \ "%-20s"%"micro_recall0" , " :%.4f\n" % micro_recall0, \ "%-20s"%"micro_F1_score1" , " :%.4f\n" % micro_F1_score1, \ "%-20s"%"micro_F1_score0" , " :%.4f\n" % micro_F1_score0#主函数if __name__=="__main__": #获得用户名-性别字典 num_sex_dic = get_sex_dic() #获取停用词 stopkey_list = [] stopkey_words = open(r'C:\Users\Administrator\Desktop\stopkey.txt').readlines() for stopkey_word in stopkey_words: stopkey_list.append(stopkey_word.replace('\n','')) #获取所有的用户编号,随机分为5组,注意保持分布的一致性 number_list_male = [] number_list_female = [] for key in num_sex_dic: if(num_sex_dic[key] == 1): number_list_male.append(key) elif(num_sex_dic[key] == 0): number_list_female.append(key) #number_list_male = number_list_male[:24]#可以选择总样本数 #number_list_female = number_list_female[:24]#可以选择总样本数 #为了维持分组中样本分布的一致性,先从男/女组中分别抽样,再合并 #男 number_group0_male = random.sample(number_list_male,40)# 组0 number_remain0_male = [ i for i in number_list_male if i not in number_group0_male ] number_group1_male = random.sample(number_remain0_male, 40)# 组1 number_remain1_male = [i for i in number_remain0_male if i not in number_group1_male] number_group2_male = random.sample(number_remain1_male, 40)# 组2 number_remain2_male = [i for i in number_remain1_male if i not in number_group2_male] number_group3_male = random.sample(number_remain2_male, 40)# 组3 number_group4_male = [i for i in number_remain2_male if i not in number_group3_male] # 组4 #女 number_group0_female = random.sample(number_list_female,40)# 组0 number_remain0_female = [ i for i in number_list_female if i not in number_group0_female ] number_group1_female = random.sample(number_remain0_female, 40)# 组1 number_remain1_female = [i for i in number_remain0_female if i not in number_group1_female] number_group2_female = random.sample(number_remain1_female, 40)# 组2 number_remain2_female = [i for i in number_remain1_female if i not in number_group2_female] number_group3_female = random.sample(number_remain2_female, 40)# 组3 number_group4_female = [i for i in number_remain2_female if i not in number_group3_female] # 组4 #合并 number_group0 =number_group0_male + number_group0_female number_group1 =number_group1_male + number_group1_female number_group2 =number_group2_male + number_group2_female number_group3 =number_group3_male + number_group3_female number_group4 =number_group4_male + number_group4_female #SVM训练样本输入-五折交叉验证 dir_path = r'C:\Users\Administrator\Desktop\data\Data_weibo_male_female\Weibos\all' true = [] predict = [] for t in range(5): #if (t==1) : break#可以选择轮数 # 拼接后用于选择 number_group = [number_group0, number_group1, number_group2, number_group3, number_group4] print r'******************正在进行第%d次训练******************'%(t+1) test_group = number_group[t] del(number_group[t]) train_group =[] for i in range(4): train_group+= number_group[i] test_group.sort() train_group.sort() print len(test_group),"test" print len(train_group),"train" num_vec_dic_train = get_wordvec(dir_path,"train",train_group,stopkey_list) train_number = [] train_vec = [] for num in num_vec_dic_train: train_number.append(num) train_vec.append(num_vec_dic_train[num]) train_vec = preprocessing.scale(train_vec) train_sex = [] for i in range(0,len(train_number)): train_sex.append(num_sex_dic[train_number[i]]) #训练SVM模型 print "训练SVM模型中..." svm_model = svm.SVC(C=10.0, kernel='rbf', gamma='auto') svm_model.fit((train_vec), train_sex) print "SVM模型训练完毕!" # SVM测试样本输入 num_vec_dic_test = get_wordvec(dir_path,"test",test_group,stopkey_list) test_number = [] test_vec = [] for num in num_vec_dic_test: test_number.append(num) test_vec.append(num_vec_dic_test[num]) test_vec = preprocessing.scale(test_vec) test_sex = [] for i in range(0, len(test_number)): test_sex.append(num_sex_dic[test_number[i]]) print "SVM模型分类中..." svm_pr_result = svm_model.predict(test_vec) print "SVM模型分类完毕!" accuracy = metrics.accuracy_score(test_sex, svm_pr_result) print "第%d次精度:"%(t+1),accuracy #多次分类结果综合 true.append(test_sex) predict.append(svm_pr_result) print r'****************第%d次训练完成****************' %(t+1) #输出分类评价结果 evaluation(true, predict)
四、输出结果
获取用户编号-性别字典中...
获取用户编号-性别字典完毕!
******************正在进行第1次训练******************
80 test
319 train
train :分词中...
Building prefix dict from the default dictionary ...
Loading model from cache c:\users\admini~1\appdata\local\temp\jieba.cache
Loading model cost 0.345 seconds.
Prefix dict has been built succesfully.
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=361514, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=171661, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第1次精度: 0.7625
****************第1次训练完成****************
******************正在进行第2次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=364865, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=165364, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第2次精度: 0.75
****************第2次训练完成****************
******************正在进行第3次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=369177, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=159058, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第3次精度: 0.6875
****************第3次训练完成****************
******************正在进行第4次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=370229, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=155387, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第4次精度: 0.5875
****************第4次训练完成****************
******************正在进行第5次训练******************
79 test
320 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=372872, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=153069, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第5次精度: 0.632911392405
****************第5次训练完成****************
获取用户编号-性别字典完毕!
******************正在进行第1次训练******************
80 test
319 train
train :分词中...
Building prefix dict from the default dictionary ...
Loading model from cache c:\users\admini~1\appdata\local\temp\jieba.cache
Loading model cost 0.345 seconds.
Prefix dict has been built succesfully.
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=361514, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=171661, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第1次精度: 0.7625
****************第1次训练完成****************
******************正在进行第2次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=364865, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=165364, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第2次精度: 0.75
****************第2次训练完成****************
******************正在进行第3次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=369177, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=159058, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第3次精度: 0.6875
****************第3次训练完成****************
******************正在进行第4次训练******************
80 test
319 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=370229, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=155387, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第4次精度: 0.5875
****************第4次训练完成****************
******************正在进行第5次训练******************
79 test
320 train
train :分词中...
train :分词完毕!
train :训练词向量中...
Word2Vec(vocab=372872, size=5, alpha=0.025)
train :词向量训练完毕!
train :训练文档向量中...
train :文档向量训练完毕!
训练SVM模型中...
SVM模型训练完毕!
test :分词中...
test :分词完毕!
test :训练词向量中...
Word2Vec(vocab=153069, size=5, alpha=0.025)
test :词向量训练完毕!
test :训练文档向量中...
test :文档向量训练完毕!
SVM模型分类中...
SVM模型分类完毕!
第5次精度: 0.632911392405
****************第5次训练完成****************
macro_accuracy :0.6841
macro_precision1 :0.7037
macro_precision0 :0.6715
macro_recall1 :0.6379
macro_recall0 :0.7300
macro_F1_score1 :0.6692
macro_F1_score0 :0.6995
micro_accuracy :0.6842
micro_precision1 :0.7017
micro_precision0 :0.6697
micro_recall1 :0.6382
micro_recall0 :0.7300
micro_F1_score1 :0.6684
micro_F1_score0 :0.6986
结论:5折交叉验证后,得到的准确率下降了,从之前的73%下降到了68.5%,其中第四轮的准确率仅为58%。而且,由于代码实现上是随机抽样的,每次的结果都不尽相同,准确率最低时会达到60%,感觉还是存在问题的。尝试进行了合并训练词向量、扩大word2vec窗口、扩大word2vec隐藏层深度,效果都不理想,其原因还需要探究。
阅读全文