mahout (一)kmeans的使用

来源:互联网 发布:免费弱视训练软件 编辑:程序博客网 时间:2024/05/23 12:02


kMeans命令行介绍

本文快速介绍如何在Hadoop集群上运行k Means集群算法。

步骤

Mahout的k-Means集群可以从相同的命令行调用中启动,无论您是在独立模式下还是在更大的Hadoop集群上运行。区别取决于$ HADOOP_HOME和$ HADOOP_CONF_DIR环境变量。如果两者都设置为目标机器上正在运行的Hadoop集群,则该调用将在该集群上运行k-Means。如果两个环境变量的缺失则单机Hadoop配置将改为调用。

在$ MAHOUT_HOME /中,构建包含作业的jar(mvn install)作业将在$ MAHOUT_HOME / core / target /中生成,其名称将包含Mahout版本号。例如,当使用Mahout 0.3版本时,作业将是mahout-core-0.3.job

在一台没有集群的机器上进行测试
     把数据:cp testdata
     运行工作:
     ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

在集群上运行它
  (根据需要)启动Hadoop:$ HADOOP_HOME / bin / start-all.sh
     把数据:$ HADOOP_HOME / bin / hadoop fs -put testdata
     运行工作:
     export HADOOP_HOME = export HADOOP_CONF_DIR = $ HADOOP_HOME / conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
     从HDFS获取数据,看看。 使用bin / hadoop fs -lsr输出查看所有输出。

Command line options

  --input (-i) input       Path to job input directory.        Must be a SequenceFile of           VectorWritable      --clusters (-c) clusters       The input centroids, as Vectors.        Must be a SequenceFile of           Writable, Cluster/Canopy. If k         is also specified, then a random        set of vectors will be selected         and written out to this path        first      --output (-o) output       The directory pathname for          output.      --distanceMeasure (-dm) distanceMeasure      The classname of the           DistanceMeasure. Default is         SquaredEuclidean       --convergenceDelta (-cd) convergenceDelta    The convergence delta value.        Default is 0.5      --maxIter (-x) maxIter       The maximum number of           iterations.      --maxRed (-r) maxRed       The number of reduce tasks.         Defaults to 2      --k (-k) k       The k in k-Means.  If specified,        then a random selection of k        Vectors will be chosen as the           Centroid and written to the         clusters input path.      --overwrite (-ow)       If present, overwrite the output        directory before running job   --help (-h)       Print out help      --clustering (-cl)       If present, run clustering after        the iterations have taken place  



原文地址:http://mahout.apache.org/users/clustering/k-means-commandline.html