【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0
来源:互联网 发布:淘宝买毕业论文靠谱吗 编辑:程序博客网 时间:2024/05/01 11:56
引言
接前一篇文章《【甘道夫】Mahout0.9 打patch使其支持 Hadoop2.2.0》
http://blog.csdn.net/u010967382/article/details/39088035,
为Mahout0.9打过Patch编译成功后,使用贝叶斯文本分类来测试Mahout0.9对Hadoop2.2.0的兼容性。
欢迎转载,转载请注明出处:
http://blog.csdn.net/u010967382/article/details/39088285
步骤一:将20news的文件都上传到hdfs
yarn@singletest:~/Mahout/mahout-distribution-0.7$ hadoop fs -ls /workspace/mahout/week4/data/20news
Found 2 items
drwxr-xr-x - yarn supergroup 0 2014-09-04 21:52 /workspace/mahout/week4/data/20news/20news-bydate-test
drwxr-xr-x - yarn supergroup 0 2014-09-04 21:57 /workspace/mahout/week4/data/20news/20news-bydate-train
步骤二:对数据创建序列文件
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seqdirectory -i /workspace/mahout/week4/data/20news -o /workspace/mahout/week4/data/20news_seq
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_seq
Found 1 items
-rw-r--r-- 1 yarn supergroup 37064977 2014-09-04 22:12 /workspace/mahout/week4/data/20news_seq/chunk-0
第三步:将序列文件转化成向量
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seq2sparse -i /workspace/mahout/week4/data/20news_seq/ -o /workspace/mahout/week4/data/20news_vectors -lnorm -nv -wt tfidf
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_vectors
Found 7 items
drwxr-xr-x - yarn supergroup 0 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/df-count
-rw-r--r-- 1 yarn supergroup 1937084 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/dictionary.file-0
-rw-r--r-- 1 yarn supergroup 1890053 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/frequency.file-0
drwxr-xr-x - yarn supergroup 0 2014-09-04 22:19 /workspace/mahout/week4/data/20news_vectors/tf-vectors
drwxr-xr-x - yarn supergroup 0 2014-09-04 22:21 /workspace/mahout/week4/data/20news_vectors/tfidf-vectors
drwxr-xr-x - yarn supergroup 0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/tokenized-documents
drwxr-xr-x - yarn supergroup 0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/wordcount
第四步:将向量集分为训练集和测试数据
参数:
- -tr训练集
- -te测试集
- -rp参数设定的是测试数据集占总数据集的百分比,以下代码设定为20%!
yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout split -i /workspace/mahout/week4/data/20news_vectors/tfidf-vectors -tr /workspace/mahout/week4/data/train-vectors -te /workspace/mahout/week4/data/test-vectors -rp 20 -ow -seq -xm sequential
第五步:训练模型
yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout trainnb -i /workspace/mahout/week4/data/train-vectors -el -o /workspace/mahout/week4/nbmodel -li /workspace/mahout/week4/labindex -ow -c
查看生成的索引:
yarn@singletest:~$ hadoop fs -text /workspace/mahout/week4/labindex
20news-bydate-test 0
20news-bydate-train 1
查看训练出来的模型:
yarn@singletest:~$ hadoop fs -ls /workspace/mahout/week4/nbmodel
Found 1 items
-rw-r--r-- 1 yarn supergroup 2437874 2014-09-05 23:09 /workspace/mahout/week4/nbmodel/naiveBayesModel.bin
第六步:测试
yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout testnb -i /workspace/mahout/week4/data/test-vectors -m /workspace/mahout/week4/nbmodel -l /workspace/mahout/week4/labindex -ow -o /workspace/mahout/week4/20news-test-result -c
注意:测试时的-i跟着的输入路径是第四步拆分出来的测试集。
测试结果:
14/09/05 23:18:09 INFO test.TestNaiveBayesDriver: Complementary Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 2887 74.9675%
Incorrectly Classified Instances : 964 25.0325%
Total Classified Instances : 3851
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
1131 413 | 1544 a = 20news-bydate-test
551 1756 | 2307 b = 20news-bydate-train
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.486
Accuracy 74.9675%
Reliability 49.7892%
Reliability (standard deviation) 0.4314
14/09/05 23:18:09 INFO driver.MahoutDriver: Program took 17504 ms (Minutes: 0.29173333333333334)
0 0
- 【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0
- 【甘道夫】Mahout0.9 打patch使其支持 Hadoop2.2.0
- 【甘道夫】通过Mahout构建贝叶斯文本分类器案例详解
- hadoop2.5.2 mahout0.10.1 测试文本分类器
- hadoop2.5.2 mahout0.10.1 测试文本分类器
- hadoop2.5.2 mahout0.10.1 测试文本分类器
- hadoop2.5.2 mahout0.10.1 测试文本分类器
- 贝叶斯文本分类
- 【甘道夫】Hive 0.13.1 on Hadoop2.2.0 + Oracle10g部署详解
- 基于朴素贝叶斯文本的分类器构造
- 查看oracle已经打过的patch
- 【甘道夫】基于Mahout0.9+CDH5.2运行分布式ItemCF推荐算法
- 【甘道夫】基于Mahout0.9+CDH5.2运行分布式ItemCF推荐算法
- hadoop2.2+mahout0.9实战
- hadoop2.2+mahout0.9实战
- hadoop2.2+mahout0.9问题
- Mahout0.9安装(Hadoop2.6.0)
- mahout0.9 hadoop2.x 编译
- 火狐网页脚本禁止
- day12,page35,total400
- RichTextbox保存为图片
- 如何修改64位Eclipse中的代码字体大小
- VS安装部署制作教程(1)
- 【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0
- unity 跑酷游戏开发笔记
- SQLServer 字段类型总结
- 面对对象5【内部类】【异常Exception】【throw和throws】【包package】【import】【jar包】
- OpenCV cvCreateTrackbar cvCmpS 实战
- JPanel加载背景图片
- mysql乱码问题
- burberry replica We know that without having a reasonable handbags
- faux sac louis vuitton We know that without having a reasonable handbags