Mahout0.8中Bayes分类器的原理与使用
来源:互联网 发布:python爬虫网页数据 编辑:程序博客网 时间:2024/06/06 08:45
1、Mahout0.8中Bayes Classifier的使用方法
在Mahout in action一书中,有通过bin/mahout ...命令行对20newsgroups进行Bayes分类的详细阐述,不过书上使用的版本是0.5的,该版本有很多bug,产生莫名其妙的错误。其中主要牵涉到三个命令,prepare20newsgroups,trainclassifier,testclassifier。从0.7版本开始,Mahout移除了这三个命令的命令行调用的API,但是提供了一种shell脚本执行的方法,该脚本在$MAHOUT_HOME/examples/classify-20newsgroups.sh。下面首先分析下该shell脚本,由此得到在新版本中使用命令行使用Bayes分类器的方法。
cd $MAHOUT_HOME
./examples/bin/classify-20newsgroups.sh
在执行脚本的过程中如果出现网络问题导致无法下载20news-bydate.tar.gz,导致后续过程无法进行,可以手动下载到执行目录(/tmp/mahout-work-${USER},USER为计算机名)再重新执行shell脚本即可。
#!/bin/bash
#
# Licensed tothe Apache Software Foundation (ASF) under one or more
# contributorlicense agreements. See the NOTICE filedistributed with
# this work foradditional information regarding copyright ownership.
# The ASFlicenses this file to You under the Apache License, Version 2.0
# (the"License"); you may not use this file except in compliance with
# theLicense. You may obtain a copy of theLicense at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unlessrequired by applicable law or agreed to in writing, software
# distributedunder the License is distributed on an "AS IS" BASIS,
# WITHOUTWARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See theLicense for the specific language governing permissions and
# limitationsunder the License.
#
#
# Downloads the20newsgroups dataset, trains and tests a classifier.
#
# To run: change into the mahout directory and type:
#examples/bin/classify-20newsgroups.sh
if ["$1" = "--help" ] || [ "$1" = "--?" ];then
echo "This script runs SGD and Bayesclassifiers over the classic 20 News Groups."
exit
fi
SCRIPT_PATH=${0%/*}
if ["$0" != "$SCRIPT_PATH" ] && ["$SCRIPT_PATH" != "" ]; then
cd $SCRIPT_PATH
fi
START_PATH=`pwd`
WORK_DIR=/tmp/mahout-work-${USER}
algorithm=(cnaivebayes naivebayes sgd clean)
if [ -n"$1" ]; then
choice=$1
else
echo "Please select a number to choosethe corresponding task to run"
echo "1. ${algorithm[0]}"
echo "2. ${algorithm[1]}"
echo "3. ${algorithm[2]}"
echo "4. ${algorithm[3]} -- cleans upthe work area in $WORK_DIR"
read -p "Enter your choice : "choice
fi
echo "ok.You chose $choice and we'll use ${algorithm[$choice-1]}"
alg=${algorithm[$choice-1]}
echo"creating work directory at ${WORK_DIR}"
mkdir -p${WORK_DIR}
if [ ! -e${WORK_DIR}/20news-bayesinput ]; then
if [ ! -e ${WORK_DIR}/20news-bydate ]; then
if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz]; then
echo "Downloading20news-bydate"
curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz-o ${WORK_DIR}/20news-bydate.tar.gz
fi
mkdir -p ${WORK_DIR}/20news-bydate
echo "Extracting..."
cd ${WORK_DIR}/20news-bydate && tarxzf ../20news-bydate.tar.gz && cd .. && cd ..
fi
fi
#echo$START_PATH
cd $START_PATH
cd ../..
set -e
if ["x$alg" == "xnaivebayes" -o "x$alg" =="xcnaivebayes" ]; then
c=""
if [ "x$alg" =="xcnaivebayes" ]; then
c=" -c"
fi
set -x
echo "Preparing 20newsgroups data"
rm -rf ${WORK_DIR}/20news-all
mkdir ${WORK_DIR}/20news-all
cp -R ${WORK_DIR}/20news-bydate/*/*${WORK_DIR}/20news-all
#将20newsgroups数据转化为序列化格式的文件
echo "Creating sequence files from20newsgroups data"
./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow
#将序列化格式的文本文件转化为向量
echo "Converting sequence files tovectors"
./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
#将向量数据随机拆分成两份80-20,分别用于训练集合测试集
echo "Creating training and holdout setwith a random 80-20 split of the generated vector dataset"
./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors\
--trainingOutput${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors \
--randomSelectionPct 40 --overwrite--sequenceFiles -xm sequential
#训练贝叶斯网络
echo "Training Naive Bayes model"
./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c
#用训练数据作为测试集,产生的误差为训练误差
echo "Self testing on training set"
./bin/mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c
#用测试集测试,产生的误差为测试误差
echo "Testing on holdout set"
./bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c
elif ["x$alg" == "xsgd" ]; then
if [ ! -e "/tmp/news-group.model"]; then
echo "Training on${WORK_DIR}/20news-bydate/20news-bydate-train/"
./bin/mahoutorg.apache.mahout.classifier.sgd.TrainNewsGroups${WORK_DIR}/20news-bydate/20news-bydate-train/
fi
echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/with model: /tmp/news-group.model"
./bin/mahoutorg.apache.mahout.classifier.sgd.TestNewsGroups --input${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model
elif ["x$alg" == "xclean" ]; then
rm -rf ${WORK_DIR}
rm -rf /tmp/news-group.model
fi
# Remove thework directory
#
2、Naive Bayes分类示例
由上面对shell脚本的分析(重点为红色部分),得到命令行下运行Bayes分类器的方法。
(1)数据准备
① 将20news-bydate.tar.gz解压,并将20news-bydate中的所有子文夹中的内容复制到20news-all中,执行如下命令:
bin/hadoop fs -mkdir 20news-all
bin/hadoop fs -put $DATA_HOME/20news-bydate/*/* 20news-all
② 从20newsgroups data创建序列文件(sequence files)
bin/mahout seqdirectory -i 20news-all -o 20news-seq
③ 将序列文件转化为向量
bin/mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf
④ 将向量数据集分为训练数据和检测数据,以随机40-60拆分
bin/mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors -- randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
(2)训练朴素贝叶斯模型
bin/mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c(3)检验朴素贝叶斯模型
bin/mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing -c
(4)检测模型分类效果
(5)查看结果,将序列文件转化为文本
- Mahout0.8中Bayes分类器的原理与使用
- Mahout0.8中贝叶斯分类器的使用方法
- 基于bayes分类器的文本分类
- 在MATLAB中实现Bayes分类器
- Mahout bayes分类器
- 基于Hadoop2.6.0的Mahout0.12.1的配置与使用
- 贝叶斯(Bayes)分类器
- Bayes classifier(贝叶斯分类器)
- Naive Bayes分类器详解
- 朴素贝叶斯分类器的应用 Naive Bayes classifier
- Mahout0.6-贝叶斯分类器对文本分类
- NLP12-Bayes与文本分类探讨
- mahout中bayes分类分析—1
- Mahout0.9的安装与测试
- mahout0.11.0的安装及使用
- 基于naive bayes的文本分类算法
- 基于naive bayes的文本分类算法
- apche mahout0.8安装与测试
- hibernate之自定义持久化实现
- JavaTomcat(一)
- MySQL数据库常用操作总结:
- treenode.tag意义
- android:label标签在application和activity中的设置问题
- Mahout0.8中Bayes分类器的原理与使用
- CoreText学习(一)Base Objects of Core Text
- vi快捷键
- Codeforces Beta Round #51 D. Beautiful numbers
- 虚拟文件系统(VFS)中的链表
- cocos2d-x plist文件读取
- 数值计算方法程序库
- Java贪吃游戏的实现!
- 最长上升子序列的O(nlogn)搞法