Mahout使用入门

来源:互联网 发布:windows教育版激活工具 编辑:程序博客网 时间:2024/06/03 23:40


一、简介

Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Apache Mahout项目已经发展到了它的第三个年头,目前已经有了三个公共发行版本。Mahout包含许多实现,包括集群、分类、推荐过滤、频繁子项挖掘。此外,通过使用Apache Hadoop 库,Mahout 可以有效地扩展到云中。

二、下载与准备

程序下载

下载hadoop http://labs.renren.com/apache-mirror/hadoop/common/下载适合版本的包(本文采用稳定版 hadoop-0.20.203.0rc1.tar.gz )

下载mahouthttp://labs.renren.com/apache-mirror/mahout/

      (本文采用mahout-distribution-0.5.tar.gz) 

如需更多功能可能还需下载maven 和mahout-collections

数据下载

数据源:http://kdd.ics.uci.edu/databases/ 里面有大量经典数据提供下载

(本文使用synthetic_control数据,synthetic_control.tar.gz)

三、安装与部署

为了不污染Linuxroot环境,本文采用在个人Home目录安装程序,程序目录为$HOME/local。

程序已经下载到$HOME/Downloads,使用tar命令解压:

tar zxvf hadoop-0.20.203.0rc1.tar.gz -C ~/local/

cd ~/local

mv hadoop-0.20.203.0 hadoop

 

tar zxvf mahout-distribution-0.5.tar.gz -C ~/local/

cd ~/local

mv mahout-distribution-0.5 mahout

 

修改.bash_profile/ .bashrc

export HADOOP_HOME=$HOME/local/hadoop

export HADOOP_CONF_DIR=$HADOOP_HOME/conf

为方便使用程序命令,可把程序bin目录添加到$PATH下,或者直接alias 。

#Alias for apps

alias mahout='$HOME/local/mahout/mahout'

alias hdp='$HOME/local/hadoop/hdp'

 

测试

输入命令: mahout

预期结果:

Running onhadoop, using HADOOP_HOME=/home/username/local/hadoop

HADOOP_CONF_DIR=/home/username/local/hadoop/conf

An exampleprogram must be given as the first argument.

Valid programnames are:

  arff.vector: : Generate Vectors from an ARFFfile or directory

  canopy: : Canopy clustering

  cat: : Print a file or resource as thelogistic regression models would see it

  cleansvd: : Cleanup and verification of SVDoutput

  clusterdump: : Dump cluster output to text

  dirichlet: : Dirichlet Clustering

  eigencuts: : Eigencuts spectral clustering

  evaluateFactorization: : compute RMSE of arating matrix factorization against probes in memory

  evaluateFactorizationParallel: : compute RMSEof a rating matrix factorization against probes

  fkmeans: : Fuzzy K-means clustering

  fpg: : Frequent Pattern Growth

  itemsimilarity: : Compute theitem-item-similarities for item-based collaborative filtering

  kmeans: : K-means clustering

  lda: : Latent Dirchlet Allocation

  ldatopics: : LDA Print Topics

  lucene.vector: : Generate Vectors from aLucene index

  matrixmult: : Take the product of twomatrices

  meanshift: : Mean Shift clustering

  parallelALS: : ALS-WR factorization of arating matrix

  predictFromFactorization: : predictpreferences from a factorization of a rating matrix

  prepare20newsgroups: : Reformat 20 newsgroupsdata

  recommenditembased: : Compute recommendationsusing item-based collaborative filtering

  rowid: : MapSequenceFile<Text,VectorWritable> to{SequenceFile<IntWritable,VectorWritable>,SequenceFile<IntWritable,Text>}

  rowsimilarity: : Compute the pairwisesimilarities of the rows of a matrix

  runlogistic: : Run a logistic regression modelagainst CSV data

  seq2sparse: : Sparse Vector generation fromText sequence files

  seqdirectory: : Generate sequence files (ofText) from a directory

  seqdumper: : Generic Sequence File dumper

  seqwiki: : Wikipedia xml dump to sequencefile

  spectralkmeans: : Spectral k-means clustering

  splitDataset: : split a rating dataset intotraining and probe parts

  ssvd: : Stochastic SVD

  svd: : Lanczos Singular Value Decomposition

  testclassifier: : Test Bayes Classifier

  trainclassifier: : Train Bayes Classifier

  trainlogistic: : Train a logistic regressionusing stochastic gradient descent

  transpose: : Take the transpose of a matrix

  vectordump: : Dump vectors from a sequencefile to text

  wikipediaDataSetCreator: : Splits data set ofwikipedia wrt feature like country

  wikipediaXMLSplitter: : Reads wikipedia dataand creates ch

输入命令:hdp

预期结果:

Usage: hadoop[--config confdir] COMMAND

where COMMAND isone of:

  namenode -format     format the DFS filesystem

  secondarynamenode    run the DFS secondary namenode

  namenode             run the DFS namenode

  datanode             run a DFS datanode

  dfsadmin             run a DFS admin client

  mradmin              run a Map-Reduce admin client

  fsck                 run a DFS filesystem checkingutility

  fs                   run a generic filesystemuser client

  balancer             run a cluster balancing utility

  fetchdt              fetch a delegation token from theNameNode

  jobtracker           run the MapReduce job Tracker node

  pipes               run a Pipes job

  tasktracker          run a MapReduce task Tracker node

  historyserver        run job history servers as a standalonedaemon

  job                  manipulate MapReduce jobs

  queue                get information regardingJobQueues

  version              print the version

  jar <jar>            run a jar file

  distcp <srcurl> <desturl> copyfile or directories recursively

  archive -archiveName NAME -p <parentpath> <src>* <dest> create a hadoop archive

  classpath            prints the class path needed to getthe

                       Hadoop jar and therequired libraries

  daemonlog            get/set the log level for eachdaemon

 or

  CLASSNAME            run the class named CLASSNAME

Most commandsprint help when invoked w/o parameters.

 

五、运行

步骤一:

通过这个命令可以查看mahout提供了哪些算法,以及如何使用

mahout --help

 

mahout kmeans --input/user/hive/warehouse/tmp_data/complex.seq  --clusters  5 --output  /home/hadoopuser/1.txt

 

mahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFile。SequenceFile是hadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看

eyjian写的http://www.hadoopor.com/viewthread.php?tid=144&highlight=sequencefile

 

mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。

(You may findTika (http://lucene.apache.org/tika) helpful in converting binary documents totext.)

使用方法如下:

 

$MAHOUT_HOME/mahout seqdirectory \

--input <PARENT DIR WHERE DOCS ARELOCATED> --output <OUTPUT DIRECTORY> \

<-c <CHARSET NAME OF THE INPUTDOCUMENTS> {UTF-8|cp1252|ascii...}> \

<-chunk <MAX SIZE OF EACH CHUNK inMegabytes> 64> \

<-prefix <PREFIX TO ADD TO THEDOCUMENT ID>>

 

举个例子:

mahout seqdirectory --input/hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8

 

步骤二:

运行kmeans的简单的例子:

 

1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下

$HADOOP_HOME/hdp fs -put <PATH TODATA> testdata

例如:

dap fs   -put ~/datasetsynthetic_controltest/synthetic_control.data  ~/local/mahout/testdata/

 

2:使用kmeans算法

hdp jar$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.joborg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

例如:

hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.joborg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

 

3:使用canopy算法

hdp jar$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.joborg.apache.mahout.clustering.syntheticcontrol.canopy.Job

例如:

hdp jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.joborg.apache.mahout.clustering.syntheticcontrol.canopy.Job

 

4:使用dirichlet 算法

mahout jar$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.joborg.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

 

5:使用meanshift算法

meanshift :

hdp jar$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.joborg.apache.mahout.clustering.syntheticcontrol.meanshift.Job

 

6:查看一下结果吧

mahout vectordump --seqFile/user/hadoopuser/output/data/part-00000

这个直接把结果显示在控制台上。

 

可以到hdfs中去看看数据是什么样子的

上面跑的例子大多以testdata作为输入和输出文件夹名

可以使用 hdp fs-lsr 来查看所有的输出结果

 

KMeans 方法的输出结果在  output/points

Canopy 和 MeanShift 结果放在了 output/clustered-points

0 0
原创粉丝点击