Mahout贝叶斯分类后数据解析

来源：互联网发布：js获取div下的class 编辑：程序博客网时间：2024/06/03 08:00

mahout0.7，hadoop1.0.4

运行本示例，参考：http://blog.csdn.net/fansy1990/article/details/11681565.

首先，贴上原始数据：

0.2,0.3,0.4:10.32,0.43,0.45:10.23,0.33,0.54:12.4,2.5,2.6:22.3,2.2,2.1:25.4,7.2,7.2:35.6,7,6:35.8,7.1,6.3:36,6,5.4:311,12,13:4

数据前三列是每个样本的属性，最后一列是样本的标签，即类别。

这里贴出来运行的输出结果：

mahout@ubuntu:~/hadoop-1.0.4/bin$ ./hadoop jar ../lib/mahout.jar mahout.fansy.bayes.BayesRunner -i /bayes/input/bayes.txt -o /bayes/output -scv , -scl : --tempDir /bayes/tempWarning: $HADOOP_HOME is deprecated.SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/mahout/hadoop-1.0.4/lib/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/mahout/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.14/01/19 22:23:59 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/input/bayes.txt], --output=[/bayes/output], --splitCharacterLabel=[:], --splitCharacterVector=[,], --startPhase=[0], --tempDir=[/bayes/temp]}***********************************转换数据开始14/01/19 22:24:00 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.14/01/19 22:24:00 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/input/bayes.txt], --output=[/bayes/output/transform], --splitCharacterLabel=[:], --splitCharacterVector=[,], --startPhase=[0], --tempDir=[temp]}14/01/19 22:24:02 INFO input.FileInputFormat: Total input paths to process : 114/01/19 22:24:03 INFO util.NativeCodeLoader: Loaded the native-hadoop library14/01/19 22:24:03 WARN snappy.LoadSnappy: Snappy native library not loaded14/01/19 22:24:07 INFO mapred.JobClient: Running job: job_201401030100_002514/01/19 22:24:09 INFO mapred.JobClient:  map 0% reduce 0%14/01/19 22:24:45 INFO mapred.JobClient:  map 100% reduce 0%14/01/19 22:24:51 INFO mapred.JobClient: Job complete: job_201401030100_002514/01/19 22:24:51 INFO mapred.JobClient: Counters: 1914/01/19 22:24:51 INFO mapred.JobClient:   Job Counters 14/01/19 22:24:51 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3054214/01/19 22:24:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/19 22:24:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/19 22:24:51 INFO mapred.JobClient:     Launched map tasks=114/01/19 22:24:51 INFO mapred.JobClient:     Data-local map tasks=114/01/19 22:24:51 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=014/01/19 22:24:51 INFO mapred.JobClient:   File Output Format Counters 14/01/19 22:24:51 INFO mapred.JobClient:     Bytes Written=52014/01/19 22:24:51 INFO mapred.JobClient:   FileSystemCounters14/01/19 22:24:51 INFO mapred.JobClient:     HDFS_BYTES_READ=24014/01/19 22:24:51 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2193814/01/19 22:24:51 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=52014/01/19 22:24:51 INFO mapred.JobClient:   File Input Format Counters 14/01/19 22:24:51 INFO mapred.JobClient:     Bytes Read=13514/01/19 22:24:51 INFO mapred.JobClient:   Map-Reduce Framework14/01/19 22:24:51 INFO mapred.JobClient:     Map input records=1014/01/19 22:24:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=6639206414/01/19 22:24:51 INFO mapred.JobClient:     Spilled Records=014/01/19 22:24:51 INFO mapred.JobClient:     CPU time spent (ms)=181014/01/19 22:24:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=1572864014/01/19 22:24:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=97446707214/01/19 22:24:51 INFO mapred.JobClient:     Map output records=1014/01/19 22:24:51 INFO mapred.JobClient:     SPLIT_RAW_BYTES=105***********************************写入indexLabel任务开始labels number is : 4***********************************BayesJob1开始执行14/01/19 22:24:52 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.14/01/19 22:24:52 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/transform], --labelIndex=[/bayes/output/labelIndex.bin], --output=[/bayes/output/job1], --startPhase=[0], --tempDir=[temp]}14/01/19 22:24:52 INFO input.FileInputFormat: Total input paths to process : 114/01/19 22:24:53 INFO mapred.JobClient: Running job: job_201401030100_002614/01/19 22:24:54 INFO mapred.JobClient:  map 0% reduce 0%14/01/19 22:26:11 INFO mapred.JobClient:  map 100% reduce 0%14/01/19 22:26:41 INFO mapred.JobClient:  map 100% reduce 100%14/01/19 22:26:47 INFO mapred.JobClient: Job complete: job_201401030100_002614/01/19 22:26:47 INFO mapred.JobClient: Counters: 2914/01/19 22:26:47 INFO mapred.JobClient:   Job Counters 14/01/19 22:26:47 INFO mapred.JobClient:     Launched reduce tasks=114/01/19 22:26:47 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=7050914/01/19 22:26:47 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/19 22:26:47 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/19 22:26:47 INFO mapred.JobClient:     Launched map tasks=114/01/19 22:26:47 INFO mapred.JobClient:     Data-local map tasks=114/01/19 22:26:47 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=2959514/01/19 22:26:47 INFO mapred.JobClient:   File Output Format Counters 14/01/19 22:26:47 INFO mapred.JobClient:     Bytes Written=27714/01/19 22:26:47 INFO mapred.JobClient:   FileSystemCounters14/01/19 22:26:47 INFO mapred.JobClient:     FILE_BYTES_READ=16214/01/19 22:26:47 INFO mapred.JobClient:     HDFS_BYTES_READ=78014/01/19 22:26:47 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581914/01/19 22:26:47 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=27714/01/19 22:26:47 INFO mapred.JobClient:   File Input Format Counters 14/01/19 22:26:47 INFO mapred.JobClient:     Bytes Read=52014/01/19 22:26:47 INFO mapred.JobClient:   Map-Reduce Framework14/01/19 22:26:47 INFO mapred.JobClient:     Map output materialized bytes=16214/01/19 22:26:47 INFO mapred.JobClient:     Map input records=1014/01/19 22:26:47 INFO mapred.JobClient:     Reduce shuffle bytes=014/01/19 22:26:47 INFO mapred.JobClient:     Spilled Records=814/01/19 22:26:47 INFO mapred.JobClient:     Map output bytes=37014/01/19 22:26:47 INFO mapred.JobClient:     Total committed heap usage (bytes)=13120716814/01/19 22:26:47 INFO mapred.JobClient:     CPU time spent (ms)=4025014/01/19 22:26:47 INFO mapred.JobClient:     Combine input records=1014/01/19 22:26:47 INFO mapred.JobClient:     SPLIT_RAW_BYTES=11914/01/19 22:26:47 INFO mapred.JobClient:     Reduce input records=414/01/19 22:26:47 INFO mapred.JobClient:     Reduce input groups=414/01/19 22:26:47 INFO mapred.JobClient:     Combine output records=414/01/19 22:26:47 INFO mapred.JobClient:     Physical memory (bytes) snapshot=24506368014/01/19 22:26:47 INFO mapred.JobClient:     Reduce output records=414/01/19 22:26:47 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=195670425614/01/19 22:26:47 INFO mapred.JobClient:     Map output records=10***********************************BayesJob2开始执行14/01/19 22:26:47 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.14/01/19 22:26:47 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/job1], --labelNumber=[4], --output=[/bayes/output/job2], --startPhase=[0], --tempDir=[temp]}14/01/19 22:26:47 INFO input.FileInputFormat: Total input paths to process : 114/01/19 22:26:48 INFO mapred.JobClient: Running job: job_201401030100_002714/01/19 22:26:49 INFO mapred.JobClient:  map 0% reduce 0%14/01/19 22:27:04 INFO mapred.JobClient:  map 100% reduce 0%14/01/19 22:27:13 INFO mapred.JobClient:  map 100% reduce 33%14/01/19 22:27:19 INFO mapred.JobClient:  map 100% reduce 100%14/01/19 22:27:24 INFO mapred.JobClient: Job complete: job_201401030100_002714/01/19 22:27:24 INFO mapred.JobClient: Counters: 2914/01/19 22:27:24 INFO mapred.JobClient:   Job Counters 14/01/19 22:27:24 INFO mapred.JobClient:     Launched reduce tasks=114/01/19 22:27:24 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=1610414/01/19 22:27:24 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/19 22:27:24 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/19 22:27:24 INFO mapred.JobClient:     Launched map tasks=114/01/19 22:27:24 INFO mapred.JobClient:     Data-local map tasks=114/01/19 22:27:24 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1479814/01/19 22:27:24 INFO mapred.JobClient:   File Output Format Counters 14/01/19 22:27:24 INFO mapred.JobClient:     Bytes Written=18714/01/19 22:27:24 INFO mapred.JobClient:   FileSystemCounters14/01/19 22:27:24 INFO mapred.JobClient:     FILE_BYTES_READ=9114/01/19 22:27:24 INFO mapred.JobClient:     HDFS_BYTES_READ=39114/01/19 22:27:24 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4484314/01/19 22:27:24 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=18714/01/19 22:27:24 INFO mapred.JobClient:   File Input Format Counters 14/01/19 22:27:24 INFO mapred.JobClient:     Bytes Read=27714/01/19 22:27:24 INFO mapred.JobClient:   Map-Reduce Framework14/01/19 22:27:24 INFO mapred.JobClient:     Map output materialized bytes=9114/01/19 22:27:24 INFO mapred.JobClient:     Map input records=414/01/19 22:27:24 INFO mapred.JobClient:     Reduce shuffle bytes=9114/01/19 22:27:24 INFO mapred.JobClient:     Spilled Records=414/01/19 22:27:24 INFO mapred.JobClient:     Map output bytes=8114/01/19 22:27:24 INFO mapred.JobClient:     Total committed heap usage (bytes)=17603379214/01/19 22:27:24 INFO mapred.JobClient:     CPU time spent (ms)=377014/01/19 22:27:24 INFO mapred.JobClient:     Combine input records=214/01/19 22:27:24 INFO mapred.JobClient:     SPLIT_RAW_BYTES=11414/01/19 22:27:24 INFO mapred.JobClient:     Reduce input records=214/01/19 22:27:24 INFO mapred.JobClient:     Reduce input groups=214/01/19 22:27:24 INFO mapred.JobClient:     Combine output records=214/01/19 22:27:24 INFO mapred.JobClient:     Physical memory (bytes) snapshot=24856576014/01/19 22:27:24 INFO mapred.JobClient:     Reduce output records=214/01/19 22:27:24 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=195775692814/01/19 22:27:24 INFO mapred.JobClient:     Map output records=2***********************************写入bayesian model 任务开始Write bayesian model to '/bayes/output/model/naiveBayesModel.bin'***********************************分类任务开始14/01/19 22:27:24 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.14/01/19 22:27:24 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/transform], --labelNumber=[4], --model=[/bayes/output/model], --output=[/bayes/output/classified], --startPhase=[0], --tempDir=[temp]}14/01/19 22:27:25 INFO input.FileInputFormat: Total input paths to process : 114/01/19 22:27:25 INFO mapred.JobClient: Running job: job_201401030100_002814/01/19 22:27:26 INFO mapred.JobClient:  map 0% reduce 0%14/01/19 22:27:40 INFO mapred.JobClient:  map 100% reduce 0%14/01/19 22:27:45 INFO mapred.JobClient: Job complete: job_201401030100_002814/01/19 22:27:45 INFO mapred.JobClient: Counters: 1914/01/19 22:27:45 INFO mapred.JobClient:   Job Counters 14/01/19 22:27:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=1559114/01/19 22:27:45 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/01/19 22:27:45 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/01/19 22:27:45 INFO mapred.JobClient:     Launched map tasks=114/01/19 22:27:45 INFO mapred.JobClient:     Data-local map tasks=114/01/19 22:27:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=014/01/19 22:27:45 INFO mapred.JobClient:   File Output Format Counters 14/01/19 22:27:45 INFO mapred.JobClient:     Bytes Written=53014/01/19 22:27:45 INFO mapred.JobClient:   FileSystemCounters14/01/19 22:27:45 INFO mapred.JobClient:     HDFS_BYTES_READ=84714/01/19 22:27:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2251914/01/19 22:27:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=53014/01/19 22:27:45 INFO mapred.JobClient:   File Input Format Counters 14/01/19 22:27:45 INFO mapred.JobClient:     Bytes Read=52014/01/19 22:27:45 INFO mapred.JobClient:   Map-Reduce Framework14/01/19 22:27:45 INFO mapred.JobClient:     Map input records=1014/01/19 22:27:45 INFO mapred.JobClient:     Physical memory (bytes) snapshot=7070515214/01/19 22:27:45 INFO mapred.JobClient:     Spilled Records=014/01/19 22:27:45 INFO mapred.JobClient:     CPU time spent (ms)=38014/01/19 22:27:45 INFO mapred.JobClient:     Total committed heap usage (bytes)=1572864014/01/19 22:27:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=97446707214/01/19 22:27:45 INFO mapred.JobClient:     Map output records=1014/01/19 22:27:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119***********************************打印测试信息开始14/01/19 22:27:46 INFO bayes.AnalyzeBayesModel: Standard NB Results: =======================================================Summary-------------------------------------------------------Correctly Classified Instances          :          7        70%Incorrectly Classified Instances        :          3        30%Total Classified Instances              :         10=======================================================Confusion Matrix-------------------------------------------------------a    b    c    d    <--Classified as3    0    0    0     |  3     a     = 10    1    0    1     |  2     b     = 21    1    2    0     |  4     c     = 30    0    0    1     |  1     d     = 4

看最后的混淆矩阵：第一行，说明a被分类为了a有3个；第二行，说明有一个正确标签是2，却被分为了4；第三行说明，有一个标签为3的被分为了1，另外一个标签为3的被分为了2，有两个标签为3的被分为了3；第四行，说明有一个标签为4的被分为了4。

综合上面混淆矩阵的解析，可以知道有3个被分错了（一共有10条记录），这个和Summary的结果一致。

下面是读取分类好的数据，读取出来的结果为：

key:1value:{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}  1key:1value:{0:-1.3075365110427406,1:-1.3183347464017319,2:-1.311911294294109,3:-1.3105998253880884}   1key:1value:{0:-1.1693011980970653,1:-1.2084735175349208,2:-1.2029164271675432,3:-1.1868650353368242}  1key:2value:{0:-8.268914090881214,1:-8.239592165010823,2:-8.24988457628885,3:-8.239013935827634}       4  <--2key:2value:{0:-7.335239784640972,1:-7.250841105209526,2:-7.275795216083463,3:-7.2793125913358425}     2   key:3value:{0:-21.62735541679608,1:-21.752523315628576,2:-21.647378264642075,3:-21.65117653755887}    1  <--3key:3value:{0:-20.51606634444014,1:-20.434188569226844,2:-20.35905447985551,3:-20.437779899276315}    3key:3value:{0:-21.16521415402051,1:-21.093355942427706,2:-21.02858358865721,3:-21.09072342236577}     3  key:3value:{0:-19.34824288117226,1:-19.11585382282511,2:-19.158537241429514,3:-19.19592701923623}     2  <--3key:4value:{0:-39.52871529566568,1:-39.55004239205195,2:-39.55547612441188,3:-39.46710853846247}      4

其中，}符号后面的是lz自己加上的。根据读取的结果，可以看到针对一个记录，可以得到一个4维度的向量（向量的维度个数和所有标签的个数一致），由这个向量来判断这条记录的标签。具体如何判断的呢？这个要看TestNaiveBayesDriver 中的analyzeResults方法，如下：

private static void analyzeResults(Map<Integer, String> labelMap,                                     SequenceFileDirIterable<Text, VectorWritable> dirIterable,                                     ResultAnalyzer analyzer) {    for (Pair<Text, VectorWritable> pair : dirIterable) {      int bestIdx = Integer.MIN_VALUE;      double bestScore = Long.MIN_VALUE;      for (Vector.Element element : pair.getSecond().get()) {        if (element.get() > bestScore) {          bestScore = element.get();          bestIdx = element.index();        }      }      if (bestIdx != Integer.MIN_VALUE) {        ClassifierResult classifierResult = new ClassifierResult(labelMap.get(bestIdx), bestScore);        analyzer.addInstance(pair.getFirst().toString(), classifierResult);      }    }  }

其实，就是把向量中最大值的下标取出来，这个下标值就是这条记录被分类的标签。比如，第一条记录：

value:{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}

最大值的下标为0，所以这条记录就被分为了第1类（类别数比下标多1）；

比如第4条记录：

key:2value:{0:-8.268914090881214,1:-8.239592165010823,2:-8.24988457628885,3:-8.239013935827634}       4  <--2

可以看到最大值的下标是3，所以被分为了第4类，但是正确的标签是2，所以这条记录是被分错了；

最后，附带一个根据建立的模型来分类数据的代码：

package mahout.fansy.bayes.classify;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.mahout.classifier.naivebayes.AbstractNaiveBayesClassifier;import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;import org.apache.mahout.math.RandomAccessSparseVector;import org.apache.mahout.math.Vector;public class BayesClassifier {/** * @param args * @throws IOException  */public static void main(String[] args) throws IOException {Configuration conf=new Configuration();conf.set("fs.default.name", "ubuntu:9000");conf.set("mapred.job.tracker", "ubuntu:9001");Path model=new Path("/bayes/output/model");//Vector vector=new DenseVector(3);Vector vector= new RandomAccessSparseVector(3);vector.set(0, 0.2);vector.set(1, 0.3);vector.set(2, 0.4);int result=new BayesClassifier().classify(conf, model, vector);System.out.println(result);}/** * get bayes model * @param conf * @param modelPath * @return * @throws IOException */public NaiveBayesModel getBayesModel(Configuration  conf,Path modelPath) throws IOException{NaiveBayesModel model = NaiveBayesModel.materialize(modelPath, conf); return model;}/** * get classifier by bayes model * @param model * @return */public AbstractNaiveBayesClassifier getClassifier(NaiveBayesModel model){AbstractNaiveBayesClassifier classifier=new StandardNaiveBayesClassifier(model);return classifier;}/** * classify the given vector  * @param classifier * @param vector */public int classify(AbstractNaiveBayesClassifier classifier,Vector vector){Vector result = classifier.classifyFull(vector);System.out.println(result);int bestIdx = Integer.MIN_VALUE;    double bestScore = Long.MIN_VALUE;    for (Vector.Element element : result) {      if (element.get() > bestScore) {          bestScore = element.get();          bestIdx = element.index();        }    }    return bestIdx;}/** * classify the vector * @param conf * @param model * @param vector * @return * @throws IOException */public int classify(Configuration conf,Path model,Vector vector) throws IOException{return this.classify(this.getClassifier(this.getBayesModel(conf, model)),vector);}}

运行程序的结果为：

{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}0

和解析分类后的结果保持一致。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

0 2