使用mahout下的朴素贝叶斯分类器对新闻分类

来源:互联网 发布:标准下载 知乎 编辑:程序博客网 时间:2024/05/12 20:47

转载地址:http://www.letiantian.me/2014-10-22-mahout-naive-bayes-newsgroups/




mahout版本是0.9;hadoop版本是1.2.1。

下载数据集20 newsgroups dataset,解压后得到20news-bydate目录:

$ cp -R 20news-bydate/*/* 20news-all 

20news-all目录下,一个子目录代表一个分类,每个分类下有多个文本文件,每个文本文件代表一个样本。

然后,复制到hdfs中:

$ hadoop  distcp file:///home/sunlt/20news-all 20news-all  

查看一下:

$ hadoop  dfs -ls                                              Warning: $HADOOP_HOME is deprecated.Found 6 items  ......drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:09 /user/sunlt/20news-all  ......

转换成的序列文件:


$ mahout seqdirectory \                 -i 20news-all \-o 20news-seq \-ow

转换成< Text, VectorWritable >的序列文件:


$ mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm  -nv -wt tfidf

此处使用了TF-IDF(词频-逆文档频率)生成向量,也可以指定为TF

更多参数,可以使用mahout seq2sparse --help查看。

拆分数据


将60%的数据用于训练,40%的数据用于测试。

$ mahout split \        -i 20news-vectors/tfidf-vectors \        --trainingOutput 20news-train-vectors \        --testOutput 20news-test-vectors  \        --randomSelectionPct 40 \        --overwrite --sequenceFiles -xm sequential

此时:

$ hadoop dfs -ls                                                         Warning: $HADOOP_HOME is deprecated.Found 10 items  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 09:52 /user/sunlt/20_newsgroups  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:09 /user/sunlt/20news-all  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:28 /user/sunlt/20news-seq  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 11:17 /user/sunlt/20news-test-vectors  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 11:17 /user/sunlt/20news-train-vectors  drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:55 /user/sunlt/20news-vectors  

训练:


$ mahout trainnb \        -i 20news-train-vectors\        -el  \        -o model \        -li labelindex \        -ow \        -c

测试训练好的贝叶斯分类器:


$ mahout testnb \        -i 20news-test-vectors \        -m model \        -l labelindex \        -ow \        -o 20news-testing  \        -c

运行结果:

.......=======================================================Summary  -------------------------------------------------------Correctly Classified Instances          :       3023       93.7946%  Incorrectly Classified Instances        :        200        6.2054%  Total Classified Instances              :       3223=======================================================Confusion Matrix  -------------------------------------------------------a        b       c       d       e       f       g       h       i       <--Classified as  306      1       0       0       1       2       1       1       16       |  328     a     = alt.atheism  0        391     2       1       3       1       0       0       2        |  400     b     = comp.windows.x  0        7       345     6       12      12      2       5       1        |  390     c     = misc.forsale  0        1       2       399     2       1       1       0       0        |  406     d     = rec.motorcycles  0        5       10      2       371     3       0       3       1        |  395     e     = sci.electronics  0        5       0       1       4       371     1       2       1        |  385     f     = sci.med  1        3       0       2       1       1       317     8       0        |  333     g     = talk.politics.guns  1        0       0       1       2       3       11      311     5        |  334     h     = talk.politics.misc  22       1       2       1       0       0       8       6       212      |  252     i     = talk.religion.misc=======================================================Statistics  -------------------------------------------------------Kappa                                       0.9105  Accuracy                                   93.7946%  Reliability                                84.0504%  Reliability (standard deviation)            0.2984  

对角线代表预测正确的数量。

补充


导出数据:

$ mahout seqdumper -i 20news-testing/part-m-00000 -o ./20news_testing.res

查看:

$ gedit ./20news_testing.res 

也可以这样查看:

$ hadoop dfs -cat 20news-testing/part-m-00000

由于是sequence文件,查看的结果是乱码。

将hdfs中某个目录中的文件拷贝到本地:

$ hadoop dfs -copyToLocal 20news-vectors/* .

参考


Twenty Newsgroups Classification Example

贝叶斯分类算法示例


0 0
原创粉丝点击