Mahout0.8中Bayes分类器的原理与使用

来源:互联网 发布:python爬虫网页数据 编辑:程序博客网 时间:2024/06/06 08:45

1、Mahout0.8中Bayes Classifier的使用方法

     在Mahout in action一书中,有通过bin/mahout ...命令行对20newsgroups进行Bayes分类的详细阐述,不过书上使用的版本是0.5的,该版本有很多bug,产生莫名其妙的错误。其中主要牵涉到三个命令,prepare20newsgroups,trainclassifier,testclassifier。从0.7版本开始,Mahout移除了这三个命令的命令行调用的API,但是提供了一种shell脚本执行的方法,该脚本在$MAHOUT_HOME/examples/classify-20newsgroups.sh。下面首先分析下该shell脚本,由此得到在新版本中使用命令行使用Bayes分类器的方法。

cd $MAHOUT_HOME

./examples/bin/classify-20newsgroups.sh

在执行脚本的过程中如果出现网络问题导致无法下载20news-bydate.tar.gz,导致后续过程无法进行,可以手动下载到执行目录(/tmp/mahout-work-${USER},USER为计算机名)再重新执行shell脚本即可。

#!/bin/bash

#

# Licensed tothe Apache Software Foundation (ASF) under one or more

# contributorlicense agreements.  See the NOTICE filedistributed with

# this work foradditional information regarding copyright ownership.

# The ASFlicenses this file to You under the Apache License, Version 2.0

# (the"License"); you may not use this file except in compliance with

# theLicense.  You may obtain a copy of theLicense at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unlessrequired by applicable law or agreed to in writing, software

# distributedunder the License is distributed on an "AS IS" BASIS,

# WITHOUTWARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See theLicense for the specific language governing permissions and

# limitationsunder the License.

#

#

# Downloads the20newsgroups dataset, trains and tests a classifier.

#

# To run:  change into the mahout directory and type:

#examples/bin/classify-20newsgroups.sh

 

if ["$1" = "--help" ] || [ "$1" = "--?" ];then

  echo "This script runs SGD and Bayesclassifiers over the classic 20 News Groups."

  exit

fi

SCRIPT_PATH=${0%/*}

if ["$0" != "$SCRIPT_PATH" ] && ["$SCRIPT_PATH" != "" ]; then

  cd $SCRIPT_PATH

fi

START_PATH=`pwd`

WORK_DIR=/tmp/mahout-work-${USER}

algorithm=(cnaivebayes naivebayes sgd clean)

if [ -n"$1" ]; then

  choice=$1

else

  echo "Please select a number to choosethe corresponding task to run"

  echo "1. ${algorithm[0]}"

  echo "2. ${algorithm[1]}"

  echo "3. ${algorithm[2]}"

  echo "4. ${algorithm[3]} -- cleans upthe work area in $WORK_DIR"

  read -p "Enter your choice : "choice

fi

echo "ok.You chose $choice and we'll use ${algorithm[$choice-1]}"

alg=${algorithm[$choice-1]}

 

echo"creating work directory at ${WORK_DIR}"

mkdir -p${WORK_DIR}

if [ ! -e${WORK_DIR}/20news-bayesinput ]; then

  if [ ! -e ${WORK_DIR}/20news-bydate ]; then

    if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz]; then

      echo "Downloading20news-bydate"

      curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz-o ${WORK_DIR}/20news-bydate.tar.gz

    fi

    mkdir -p ${WORK_DIR}/20news-bydate

    echo "Extracting..."

    cd ${WORK_DIR}/20news-bydate && tarxzf ../20news-bydate.tar.gz && cd .. && cd ..

  fi

fi

#echo$START_PATH

cd $START_PATH

cd ../..

set -e

if ["x$alg" == "xnaivebayes" -o  "x$alg" =="xcnaivebayes" ]; then

  c=""

  if [ "x$alg" =="xcnaivebayes" ]; then

    c=" -c"

  fi

  set -x

  echo "Preparing 20newsgroups data"

  rm -rf ${WORK_DIR}/20news-all

  mkdir ${WORK_DIR}/20news-all

  cp -R ${WORK_DIR}/20news-bydate/*/*${WORK_DIR}/20news-all

#20newsgroups数据转化为序列化格式的文件

  echo "Creating sequence files from20newsgroups data"

  ./bin/mahout seqdirectory \

    -i ${WORK_DIR}/20news-all \

    -o ${WORK_DIR}/20news-seq -ow

#将序列化格式的文本文件转化为向量

  echo "Converting sequence files tovectors"

  ./bin/mahout seq2sparse \

    -i ${WORK_DIR}/20news-seq \

    -o ${WORK_DIR}/20news-vectors  -lnorm -nv -wt tfidf

#将向量数据随机拆分成两份80-20,分别用于训练集合测试集

  echo "Creating training and holdout setwith a random 80-20 split of the generated vector dataset"

  ./bin/mahout split \

    -i ${WORK_DIR}/20news-vectors/tfidf-vectors\

    --trainingOutput${WORK_DIR}/20news-train-vectors \

    --testOutput ${WORK_DIR}/20news-test-vectors  \

    --randomSelectionPct 40 --overwrite--sequenceFiles -xm sequential

#训练贝叶斯网络

  echo "Training Naive Bayes model"

  ./bin/mahout trainnb \

    -i ${WORK_DIR}/20news-train-vectors -el \

    -o ${WORK_DIR}/model \

    -li ${WORK_DIR}/labelindex \

    -ow $c

#用训练数据作为测试集,产生的误差为训练误差

  echo "Self testing on training set"

  ./bin/mahout testnb \

    -i ${WORK_DIR}/20news-train-vectors\

    -m ${WORK_DIR}/model \

    -l ${WORK_DIR}/labelindex \

    -ow -o ${WORK_DIR}/20news-testing $c

#用测试集测试,产生的误差为测试误差

  echo "Testing on holdout set"

  ./bin/mahout testnb \

    -i ${WORK_DIR}/20news-test-vectors\

    -m ${WORK_DIR}/model \

    -l ${WORK_DIR}/labelindex \

    -ow -o ${WORK_DIR}/20news-testing $c

elif ["x$alg" == "xsgd" ]; then

  if [ ! -e "/tmp/news-group.model"]; then

    echo "Training on${WORK_DIR}/20news-bydate/20news-bydate-train/"

    ./bin/mahoutorg.apache.mahout.classifier.sgd.TrainNewsGroups${WORK_DIR}/20news-bydate/20news-bydate-train/

  fi

  echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/with model: /tmp/news-group.model"

  ./bin/mahoutorg.apache.mahout.classifier.sgd.TestNewsGroups --input${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model

elif ["x$alg" == "xclean" ]; then

  rm -rf ${WORK_DIR}

  rm -rf /tmp/news-group.model

fi

# Remove thework directory

#

2、Naive Bayes分类示例

    由上面对shell脚本的分析(重点为红色部分),得到命令行下运行Bayes分类器的方法。

(1)数据准备

        ① 将20news-bydate.tar.gz解压,并将20news-bydate中的所有子文夹中的内容复制到20news-all中,执行如下命令:

         bin/hadoop fs -mkdir 20news-all

         bin/hadoop fs -put $DATA_HOME/20news-bydate/*/* 20news-all

        ② 从20newsgroups data创建序列文件(sequence files)

        bin/mahout seqdirectory -i 20news-all -o 20news-seq

        ③ 将序列文件转化为向量

        bin/mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf

         将向量数据集分为训练数据和检测数据,以随机40-60拆分

       bin/mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --        randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

(2)训练朴素贝叶斯模型

        bin/mahout trainnb -i  20news-train-vectors -el -o model -li labelindex -ow -c

(3)检验朴素贝叶斯模型

       bin/mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing -c

结果如下:
www.r66r.net

(4)检测模型分类效果

bin/mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing -c
结果如下:
2

(5)查看结果,将序列文件转化为文本

bin/mahout seqdumper -i 20news-testing/part-m-00000 -o 20news_testing.res
cat 20news_testging.res
结果如下:

3



原创粉丝点击