数据挖掘-基于贝叶斯算法及KNN算法的newsgroup18828文本分类器的JAVA实现（下）

来源：互联网发布：ubuntu 查看配置编辑：程序博客网时间：2024/06/06 19:51

本文接数据挖掘-基于贝叶斯算法及KNN算法的newsgroup18828文档分类器的JAVA实现（上）

(update 2012.12.28 关于本项目下载及运行的常见问题 FAQ见 newsgroup18828文本分类器、文本聚类器、关联分析频繁模式挖掘算法的Java实现工程下载及运行FAQ )

上文中描述了newsgroup18828文档集的预处理及贝叶斯算法的JAVA实现，下面我们来看看如何实现基于KNN算法的newsgroup文本分类器

1 KNN算法的描述

KNN算法描述如下：
STEP ONE:文本向量化表示,由特征词的TF*IDF值计算
STEP TWO:在新文本到达后，根据特征词确定新文本的向量
STEP THREE:在训练文本集中选出与新文本最相似的 K 个文本，相似度用向量夹角余弦度量，计算公式为：

其中，K 值的确定目前没有很好的方法，一般采用先定一个初始值，然后根据实验测试的结果调整 K 值
本项目中K取20

STEP FOUR:在新文本的 K 个邻居中，依次计算每类的权重，每类的权重等于K个邻居中属于该类的训练样本与测试样本的相似度之和。
STEP FIVE:比较类的权重，将文本分到权重最大的那个类别中。

2 文档TF-IDF计算及向量化表示

实现KNN算法首先要实现文档的向量化表示
计算特征词的TF*IDF，每个文档的向量由包含所有特征词的TF*IDF值组成，每一维对应一个特征词

TF及IDF的计算公式如下，分别为特征词的特征项频率和逆文档频率

文档向量计算类 ComputeWordsVector.java如下

package com.pku.yangliu;import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.FileWriter;import java.io.IOException;import java.util.SortedMap;import java.util.Map;import java.util.Set;import java.util.TreeMap;import java.util.Iterator;/**计算文档的属性向量，将所有文档向量化 * */public class ComputeWordsVector {/**计算文档的TF属性向量,直接写成二维数组遍历形式即可，没必要递归 * @param strDir 处理好的newsgroup文件目录的绝对路径 * @param trainSamplePercent 训练样例集占每个类目的比例 * @param indexOfSample 测试样例集的起始的测试样例编号 * @param wordMap 属性词典map * @throws IOException  */public void computeTFMultiIDF(String strDir, double trainSamplePercent, int indexOfSample, Map<String, Double> iDFPerWordMap, Map<String, Double> wordMap) throws IOException{File fileDir = new File(strDir);String word;SortedMap<String,Double> TFPerDocMap = new TreeMap<String,Double>();//注意可以用两个写文件，一个专门写测试样例，一个专门写训练样例，用sampleType的值来表示String trainFileDir = "F:/DataMiningSample/docVector/wordTFIDFMapTrainSample"+indexOfSample;String testFileDir = "F:/DataMiningSample/docVector/wordTFIDFMapTestSample"+indexOfSample;FileWriter tsTrainWriter = new FileWriter(new File(trainFileDir));FileWriter tsTestWrtier = new FileWriter(new File(testFileDir));FileWriter tsWriter = tsTrainWriter;File[] sampleDir = fileDir.listFiles();for(int i = 0; i < sampleDir.length; i++){String cateShortName = sampleDir[i].getName();System.out.println("compute: " + cateShortName);File[] sample = sampleDir[i].listFiles();double testBeginIndex = indexOfSample*(sample.length * (1-trainSamplePercent));//测试样例的起始文件序号double testEndIndex = (indexOfSample+1)*(sample.length * (1-trainSamplePercent));//测试样例集的结束文件序号System.out.println("dirName_total length:"+sampleDir[i].getCanonicalPath()+"_"+sample.length);System.out.println(trainSamplePercent + " length:"+sample.length * trainSamplePercent +" testBeginIndex:"+testBeginIndex+" testEndIndex"+ testEndIndex);for(int j = 0;j < sample.length; j++){TFPerDocMap.clear();FileReader samReader = new FileReader(sample[j]);BufferedReader samBR = new BufferedReader(samReader);String fileShortName = sample[j].getName();Double wordSumPerDoc = 0.0;//计算每篇文档的总词数while((word = samBR.readLine()) != null){if(!word.isEmpty() && wordMap.containsKey(word)){//必须是属性词典里面的词，去掉的词不考虑wordSumPerDoc++;if(TFPerDocMap.containsKey(word)){Double count =  TFPerDocMap.get(word);TFPerDocMap.put(word, count + 1);}else {TFPerDocMap.put(word, 1.0);}}}//遍历一下当前文档的TFmap，除以文档的总词数换成词频,然后将词频乘以词的IDF，得到最终的特征权值，并且输出到文件//注意测试样例和训练样例写入的文件不同if(j >= testBeginIndex && j <= testEndIndex){tsWriter = tsTestWrtier;}else{tsWriter = tsTrainWriter;}Double wordWeight;Set<Map.Entry<String, Double>> tempTF = TFPerDocMap.entrySet();for(Iterator<Map.Entry<String, Double>> mt = tempTF.iterator(); mt.hasNext();){Map.Entry<String, Double> me = mt.next();//wordWeight =  (me.getValue() / wordSumPerDoc) * IDFPerWordMap.get(me.getKey());//这里IDF暂时设为1，具体的计算IDF算法改进和实现见我的博客中关于kmeans聚类的博文wordWeight =  (me.getValue() / wordSumPerDoc) * 1.0;TFPerDocMap.put(me.getKey(), wordWeight);}tsWriter.append(cateShortName + " ");String keyWord = fileShortName.substring(0,5);tsWriter.append(keyWord+ " ");Set<Map.Entry<String, Double>> tempTF2 = TFPerDocMap.entrySet();for(Iterator<Map.Entry<String, Double>> mt = tempTF2.iterator(); mt.hasNext();){Map.Entry<String, Double> ne = mt.next();tsWriter.append(ne.getKey() + " " + ne.getValue() + " ");}tsWriter.append("\n");tsWriter.flush();}}tsTrainWriter.close();tsTestWrtier.close();tsWriter.close();}/**统计每个词的总的出现次数，返回出现次数大于3次的词汇构成最终的属性词典 * @param strDir 处理好的newsgroup文件目录的绝对路径 * @throws IOException  */public SortedMap<String,Double> countWords(String strDir,Map<String, Double> wordMap) throws IOException{File sampleFile = new File(strDir);File [] sample = sampleFile.listFiles();String word;for(int i = 0; i < sample.length; i++){if(!sample[i].isDirectory()){if(sample[i].getName().contains("stemed")){FileReader samReader = new FileReader(sample[i]);BufferedReader samBR = new BufferedReader(samReader);while((word = samBR.readLine()) != null){if(!word.isEmpty() && wordMap.containsKey(word)){double count = wordMap.get(word) + 1;wordMap.put(word, count);}else {wordMap.put(word, 1.0);}}}}else countWords(sample[i].getCanonicalPath(),wordMap);}//只返回出现次数大于3的单词SortedMap<String,Double> newWordMap = new TreeMap<String,Double>();Set<Map.Entry<String,Double>> allWords = wordMap.entrySet();for(Iterator<Map.Entry<String,Double>> it = allWords.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();if(me.getValue() >= 1){newWordMap.put(me.getKey(),me.getValue());}}return newWordMap;}/**打印属性词典 * @param SortedMap<String,Double> 属性词典 * @throws IOException  */void printWordMap(Map<String, Double> wordMap) throws IOException {// TODO Auto-generated method stubSystem.out.println("printWordMap");int countLine = 0;File outPutFile = new File("F:/DataMiningSample/docVector/allDicWordCountMap.txt");FileWriter outPutFileWriter = new FileWriter(outPutFile);Set<Map.Entry<String,Double>> allWords = wordMap.entrySet();for(Iterator<Map.Entry<String,Double>> it = allWords.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();outPutFileWriter.write(me.getKey()+" "+me.getValue()+"\n");countLine++;}System.out.println("WordMap size" + countLine);}/**计算IDF，即属性词典中每个词在多少个文档中出现过 * @param SortedMap<String,Double> 属性词典 * @return 单词的IDFmap * @throws IOException  */SortedMap<String,Double> computeIDF(String string, Map<String, Double> wordMap) throws IOException {// TODO Auto-generated method stubFile fileDir = new File(string);String word;SortedMap<String,Double> IDFPerWordMap = new TreeMap<String,Double>();Set<Map.Entry<String, Double>> wordMapSet = wordMap.entrySet();for(Iterator<Map.Entry<String, Double>> pt = wordMapSet.iterator(); pt.hasNext();){Map.Entry<String, Double> pe = pt.next();Double coutDoc = 0.0;String dicWord = pe.getKey();File[] sampleDir = fileDir.listFiles();for(int i = 0; i < sampleDir.length; i++){File[] sample = sampleDir[i].listFiles();for(int j = 0;j < sample.length; j++){FileReader samReader = new FileReader(sample[j]);BufferedReader samBR = new BufferedReader(samReader);boolean isExited = false;while((word = samBR.readLine()) != null){if(!word.isEmpty() && word.equals(dicWord)){isExited = true;break;}}if(isExited) coutDoc++;}}//计算单词的IDFDouble IDF = Math.log(20000 / coutDoc) / Math.log(10);IDFPerWordMap.put(dicWord, IDF);}return IDFPerWordMap;}}

3 KNN算法的实现

KNN算法的实现要注意

(1)用TreeMap<String,TreeMap<String,Double>>保存测试集和训练集
(2)注意要以"类目_文件名"作为每个文件的key，才能避免同名不同内容的文件出现
(3)注意设置JM参数，否则会出现JAVA heap溢出错误
(4)本程序用向量夹角余弦计算相似度

KNN算法实现类 KNNClassifier.java如下

package com.pku.yangliu;import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.FileWriter;import java.io.IOException;import java.util.Comparator;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Set;import java.util.TreeMap;/**KNN算法的实现类，本程序用向量夹角余弦计算相似度 * */public class KNNClassifier {/**用KNN算法对测试文档集分类,读取测试样例和训练样例集 * @param trainFiles 训练样例的所有向量构成的文件 * @param testFiles 测试样例的所有向量构成的文件 * @param kNNResultFile KNN分类结果文件路径 * @return double 分类准确率 * @throws IOException  */private double doProcess(String trainFiles, String testFiles,String kNNResultFile) throws IOException {// TODO Auto-generated method stub//首先读取训练样本和测试样本，用map<String,map<word,TF>>保存测试集和训练集，注意训练样本的类目信息也得保存，//然后遍历测试样本，对于每一个测试样本去计算它与所有训练样本的相似度，相似度保存入map<String,double>有//序map中去，然后取前K个样本，针对这k个样本来给它们所属的类目计算权重得分，对属于同一个类目的权重求和进而得到//最大得分的类目，就可以判断测试样例属于该类目下，K值可以反复测试，找到分类准确率最高的那个值//！注意要以"类目_文件名"作为每个文件的key，才能避免同名不同内容的文件出现//！注意设置JM参数，否则会出现JAVA heap溢出错误//！本程序用向量夹角余弦计算相似度File trainSamples = new File(trainFiles);BufferedReader trainSamplesBR = new BufferedReader(new FileReader(trainSamples));String line;String [] lineSplitBlock;Map<String,TreeMap<String,Double>> trainFileNameWordTFMap = new TreeMap<String,TreeMap<String,Double>> ();TreeMap<String,Double> trainWordTFMap = new TreeMap<String,Double>();while((line = trainSamplesBR.readLine()) != null){lineSplitBlock = line.split(" ");trainWordTFMap.clear();for(int i = 2; i < lineSplitBlock.length; i = i + 2){trainWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i+1]));}TreeMap<String,Double> tempMap = new TreeMap<String,Double>();tempMap.putAll(trainWordTFMap);trainFileNameWordTFMap.put(lineSplitBlock[0]+"_"+lineSplitBlock[1], tempMap);}trainSamplesBR.close();File testSamples = new File(testFiles);BufferedReader testSamplesBR = new BufferedReader(new FileReader(testSamples));Map<String,Map<String,Double>> testFileNameWordTFMap = new TreeMap<String,Map<String,Double>> ();Map<String,String> testClassifyCateMap = new TreeMap<String, String>();//分类形成的<文件名，类目>对Map<String,Double> testWordTFMap = new TreeMap<String,Double>();while((line = testSamplesBR.readLine()) != null){lineSplitBlock = line.split(" ");testWordTFMap.clear();for(int i = 2; i < lineSplitBlock.length; i = i + 2){testWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i+1]));}TreeMap<String,Double> tempMap = new TreeMap<String,Double>();tempMap.putAll(testWordTFMap);testFileNameWordTFMap.put(lineSplitBlock[0]+"_"+lineSplitBlock[1], tempMap);}testSamplesBR.close();//下面遍历每一个测试样例计算与所有训练样本的距离，做分类String classifyResult;FileWriter testYangliuWriter = new FileWriter(new File("F:/DataMiningSample/docVector/yangliuTest"));FileWriter KNNClassifyResWriter = new FileWriter(kNNResultFile);Set<Map.Entry<String,Map<String,Double>>> testFileNameWordTFMapSet = testFileNameWordTFMap.entrySet();for(Iterator<Map.Entry<String,Map<String,Double>>> it = testFileNameWordTFMapSet.iterator(); it.hasNext();){Map.Entry<String, Map<String,Double>> me = it.next();classifyResult = KNNComputeCate(me.getKey(), me.getValue(), trainFileNameWordTFMap, testYangliuWriter);KNNClassifyResWriter.append(me.getKey()+" "+classifyResult+"\n");KNNClassifyResWriter.flush();testClassifyCateMap.put(me.getKey(), classifyResult);}KNNClassifyResWriter.close();//计算分类的准确率double righteCount = 0;Set<Map.Entry<String, String>> testClassifyCateMapSet = testClassifyCateMap.entrySet();for(Iterator <Map.Entry<String, String>> it = testClassifyCateMapSet.iterator(); it.hasNext();){Map.Entry<String, String> me = it.next();String rightCate = me.getKey().split("_")[0];if(me.getValue().equals(rightCate)){righteCount++;}}testYangliuWriter.close();return righteCount / testClassifyCateMap.size();}/**对于每一个测试样本去计算它与所有训练样本的向量夹角余弦相似度 * 相似度保存入map<String,double>有序map中去，然后取前K个样本， * 针对这k个样本来给它们所属的类目计算权重得分，对属于同一个类 * 目的权重求和进而得到最大得分的类目，就可以判断测试样例属于该 * 类目下。K值可以反复测试，找到分类准确率最高的那个值 * @param testWordTFMap 当前测试文件的<单词,词频>向量 * @param trainFileNameWordTFMap 训练样本<类目_文件名,向量>Map * @param testYangliuWriter  * @return String K个邻居权重得分最大的类目 * @throws IOException  */private String KNNComputeCate(String testFileName,Map<String, Double> testWordTFMap,Map<String, TreeMap<String, Double>> trainFileNameWordTFMap, FileWriter testYangliuWriter) throws IOException {// TODO Auto-generated method stubHashMap<String,Double> simMap = new HashMap<String,Double>();//<类目_文件名,距离> 后面需要将该HashMap按照value排序double similarity;Set<Map.Entry<String,TreeMap<String,Double>>> trainFileNameWordTFMapSet = trainFileNameWordTFMap.entrySet();for(Iterator<Map.Entry<String,TreeMap<String,Double>>> it = trainFileNameWordTFMapSet.iterator(); it.hasNext();){Map.Entry<String, TreeMap<String,Double>> me = it.next();similarity = computeSim(testWordTFMap, me.getValue());simMap.put(me.getKey(),similarity);}//下面对simMap按照value排序ByValueComparator bvc = new ByValueComparator(simMap);TreeMap<String,Double> sortedSimMap = new TreeMap<String,Double>(bvc);sortedSimMap.putAll(simMap);//在disMap中取前K个最近的训练样本对其类别计算距离之和，K的值通过反复试验而得Map<String,Double> cateSimMap = new TreeMap<String,Double>();//K个最近训练样本所属类目的距离之和double K = 20;double count = 0;double tempSim;Set<Map.Entry<String, Double>> simMapSet = sortedSimMap.entrySet();for(Iterator<Map.Entry<String, Double>> it = simMapSet.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();count++;String categoryName = me.getKey().split("_")[0];if(cateSimMap.containsKey(categoryName)){tempSim = cateSimMap.get(categoryName);cateSimMap.put(categoryName, tempSim + me.getValue());}else cateSimMap.put(categoryName, me.getValue());if (count > K) break;}//下面到cateSimMap里面把sim最大的那个类目名称找出来//testYangliuWriter.flush();//testYangliuWriter.close();double maxSim = 0;String bestCate = null;Set<Map.Entry<String, Double>> cateSimMapSet = cateSimMap.entrySet();for(Iterator<Map.Entry<String, Double>> it = cateSimMapSet.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();if(me.getValue()> maxSim){bestCate = me.getKey();maxSim = me.getValue();}}return bestCate;}/**计算测试样本向量和训练样本向量的相似度 * @param testWordTFMap 当前测试文件的<单词,词频>向量 * @param trainWordTFMap 当前训练样本<单词,词频>向量 * @return Double 向量之间的相似度 以向量夹角余弦计算 * @throws IOException  */private double computeSim(Map<String, Double> testWordTFMap,Map<String, Double> trainWordTFMap) {// TODO Auto-generated method stubdouble mul = 0, testAbs = 0, trainAbs = 0;Set<Map.Entry<String, Double>> testWordTFMapSet = testWordTFMap.entrySet();for(Iterator<Map.Entry<String, Double>> it = testWordTFMapSet.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();if(trainWordTFMap.containsKey(me.getKey())){mul += me.getValue()*trainWordTFMap.get(me.getKey());}testAbs += me.getValue() * me.getValue();}testAbs = Math.sqrt(testAbs);Set<Map.Entry<String, Double>> trainWordTFMapSet = trainWordTFMap.entrySet();for(Iterator<Map.Entry<String, Double>> it = trainWordTFMapSet.iterator(); it.hasNext();){Map.Entry<String, Double> me = it.next();trainAbs += me.getValue()*me.getValue();}trainAbs = Math.sqrt(trainAbs);return mul / (testAbs * trainAbs);}/**根据KNN算法分类结果文件生成正确类目文件，而正确率和混淆矩阵的计算可以复用贝叶斯算法类中的方法 * @param kNNRightFile 分类正确类目文件 * @param kNNResultFile 分类结果文件 * @throws IOException  */private void createRightFile(String kNNResultFile, String kNNRightFile) throws IOException {// TODO Auto-generated method stubString rightCate;FileReader fileR = new FileReader(kNNResultFile);FileWriter KNNRrightResult = new FileWriter(new File(kNNRightFile));BufferedReader fileBR = new BufferedReader(fileR);String line;String lineBlock[];while((line = fileBR.readLine()) != null){lineBlock = line.split(" ");rightCate = lineBlock[0].split("_")[0];KNNRrightResult.append(lineBlock[0]+" "+rightCate+"\n");}KNNRrightResult.flush();KNNRrightResult.close();}/** * @param args * @throws IOException  */public void KNNClassifierMain(String[] args) throws IOException {// TODO Auto-generated method stub//wordMap是所有属性词的词典<单词，在所有文档中出现的次数>double[] accuracyOfEveryExp = new double[10];double accuracyAvg,sum = 0;KNNClassifier knnClassifier = new KNNClassifier();NaiveBayesianClassifier nbClassifier = new NaiveBayesianClassifier();Map<String,Double> wordMap = new TreeMap<String,Double>();Map<String,Double> IDFPerWordMap = new TreeMap<String,Double>();ComputeWordsVector computeWV = new ComputeWordsVector();wordMap = computeWV.countWords("F:/DataMiningSample/processedSample_includeNotSpecial", wordMap);IDFPerWordMap = computeWV.computeIDF("F:/DataMiningSample/processedSampleOnlySpecial",wordMap);computeWV.printWordMap(wordMap);//首先生成KNN算法10次试验需要的文档TF矩阵文件for(int i = 0; i < 10; i++){computeWV.computeTFMultiIDF("F:/DataMiningSample/processedSampleOnlySpecial",0.9, i, IDFPerWordMap,wordMap);String trainFiles = "F:/DataMiningSample/docVector/wordTFIDFMapTrainSample"+i;String testFiles = "F:/DataMiningSample/docVector/wordTFIDFMapTestSample"+i;String kNNResultFile = "F:/DataMiningSample/docVector/KNNClassifyResult"+i;String kNNRightFile = "F:/DataMiningSample/docVector/KNNClassifyRight"+i;accuracyOfEveryExp[i] = knnClassifier.doProcess(trainFiles, testFiles, kNNResultFile);knnClassifier.createRightFile(kNNResultFile,kNNRightFile);accuracyOfEveryExp[i] = nbClassifier.computeAccuracy(kNNResultFile, kNNRightFile);//计算准确率复用贝叶斯算法中的方法sum += accuracyOfEveryExp[i];System.out.println("The accuracy for KNN Classifier in "+i+"th Exp is :" + accuracyOfEveryExp[i]);}accuracyAvg = sum / 10;System.out.println("The average accuracy for KNN Classifier in all Exps is :" + accuracyAvg);}//对HashMap按照value做排序static class ByValueComparator implements Comparator<Object> {HashMap<String, Double> base_map;public ByValueComparator(HashMap<String, Double> disMap) {this.base_map = disMap;}@Overridepublic int compare(Object o1, Object o2) {// TODO Auto-generated method stubString arg0 = o1.toString();String arg1 = o2.toString();if (!base_map.containsKey(arg0) || !base_map.containsKey(arg1)) {return 0;}if (base_map.get(arg0) < base_map.get(arg1)) {return 1;} else if (base_map.get(arg0) == base_map.get(arg1)) {return 0;} else {return -1;}}}}

分类器主类

package com.pku.yangliu;/**分类器主分类，依次执行数据预处理、朴素贝叶斯分类、KNN分类 * */public class ClassifierMain {public static void main(String[] args) throws Exception {// TODO Auto-generated method stubDataPreProcess DataPP = new DataPreProcess();NaiveBayesianClassifier nbClassifier = new NaiveBayesianClassifier();KNNClassifier knnClassifier = new KNNClassifier();//DataPP.BPPMain(args);nbClassifier.NaiveBayesianClassifierMain(args);knnClassifier.KNNClassifierMain(args);}}

5 KNN算法的分类结果

用混淆矩阵表示如下，第6次实验准确率达到82.10%

程序运行环境硬件环境：Intel Core 2 Duo CPU T5750 2GHZ, 2G内存，相同硬件环境计算和贝叶斯算法做对比

实验结果如上所示取出现次数大于等于4次的词共计30095个作为特征词： 10次交叉验证实验平均准确率78.19%，用时1h55min，10词实验准确率范围73.62%-82.10%，其中有3次实验准确率超过80%

6 朴素贝叶斯与KNN分类准确率对比
取出现次数大于等于4次的词共计30095个作为特征词，做10次交叉验证实验，朴素贝叶斯和KNN算法对Newsgroup文档分类结果对比：

点击打开链接

结论
分类准确率上，KNN算法更优
分类速度上，朴素贝叶斯算法更优