mahout-0.6运行canopy聚类算法

来源:互联网 发布:网络兼职写手 编辑:程序博客网 时间:2024/05/20 05:30

1、将文本文件向量化

01.mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector  
[root@masterclone ~]# hadoop fs -ls /mahout/output/vectorfilesWarning: $HADOOP_HOME is deprecated.Found 3 items-rw-r--r--   1 root supergroup          0 2014-05-12 06:58 /mahout/output/vectorfiles/_SUCCESSdrwxr-xr-x   - root supergroup          0 2014-05-12 06:58 /mahout/output/vectorfiles/_logs-rw-r--r--   1 root supergroup      56430 2014-05-12 06:58 /mahout/output/vectorfiles/part-m-00000

 详细步骤:http://blog.csdn.net/panguoyuan/article/details/25655763

2、运行canopy聚类算法

mahout canopy -i /mahout/output/vectorfiles -o /mahout/output/canopy-result -t1 1 -t2 2 -ow
[root@masterclone ~]# mahout canopy -i /mahout/output/vectorfiles -o /mahout/output/canopy-result -t1 1 -t2 2 -owMAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.Running on hadoop, using HADOOP_HOME=/usr/lib/hadoopHADOOP_CONF_DIR=/usr/lib/hadoop/confMAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jarWarning: $HADOOP_HOME is deprecated.14/05/12 16:23:17 INFO common.AbstractJob: Command line arguments: {--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mahout/output/vectorfiles, --method=mapreduce, --output=/mahout/output/canopy-result, --overwrite=null, --startPhase=0, --t1=1, --t2=2, --tempDir=temp}14/05/12 16:23:17 INFO canopy.CanopyDriver: Build Clusters Input: /mahout/output/vectorfiles Out: /mahout/output/canopy-result Measure: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@6d79953c t1: 1.0 t2: 2.014/05/12 16:23:19 INFO input.FileInputFormat: Total input paths to process : 114/05/12 16:23:19 INFO mapred.JobClient: Running job: job_201405121559_000514/05/12 16:23:20 INFO mapred.JobClient:  map 0% reduce 0%14/05/12 16:23:31 INFO mapred.JobClient:  map 100% reduce 0%14/05/12 16:23:39 INFO mapred.JobClient:  map 100% reduce 33%14/05/12 16:23:41 INFO mapred.JobClient:  map 100% reduce 100%14/05/12 16:23:43 INFO mapred.JobClient: Job complete: job_201405121559_000514/05/12 16:23:43 INFO mapred.JobClient: Counters: 2914/05/12 16:23:43 INFO mapred.JobClient:   Job Counters 14/05/12 16:23:43 INFO mapred.JobClient:     Launched reduce tasks=114/05/12 16:23:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=1007114/05/12 16:23:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=014/05/12 16:23:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=014/05/12 16:23:43 INFO mapred.JobClient:     Launched map tasks=114/05/12 16:23:43 INFO mapred.JobClient:     Data-local map tasks=114/05/12 16:23:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1014514/05/12 16:23:43 INFO mapred.JobClient:   File Output Format Counters 14/05/12 16:23:43 INFO mapred.JobClient:     Bytes Written=21014/05/12 16:23:43 INFO mapred.JobClient:   FileSystemCounters14/05/12 16:23:43 INFO mapred.JobClient:     FILE_BYTES_READ=3814/05/12 16:23:43 INFO mapred.JobClient:     HDFS_BYTES_READ=5655714/05/12 16:23:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=10866214/05/12 16:23:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=21014/05/12 16:23:43 INFO mapred.JobClient:   File Input Format Counters 14/05/12 16:23:43 INFO mapred.JobClient:     Bytes Read=5643014/05/12 16:23:43 INFO mapred.JobClient:   Map-Reduce Framework14/05/12 16:23:43 INFO mapred.JobClient:     Map output materialized bytes=3814/05/12 16:23:43 INFO mapred.JobClient:     Map input records=180014/05/12 16:23:43 INFO mapred.JobClient:     Reduce shuffle bytes=3814/05/12 16:23:43 INFO mapred.JobClient:     Spilled Records=214/05/12 16:23:43 INFO mapred.JobClient:     Map output bytes=3014/05/12 16:23:43 INFO mapred.JobClient:     CPU time spent (ms)=140014/05/12 16:23:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=17603379214/05/12 16:23:43 INFO mapred.JobClient:     Combine input records=014/05/12 16:23:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=12714/05/12 16:23:43 INFO mapred.JobClient:     Reduce input records=114/05/12 16:23:43 INFO mapred.JobClient:     Reduce input groups=114/05/12 16:23:43 INFO mapred.JobClient:     Combine output records=014/05/12 16:23:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=25711411214/05/12 16:23:43 INFO mapred.JobClient:     Reduce output records=114/05/12 16:23:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=210033868814/05/12 16:23:43 INFO mapred.JobClient:     Map output records=114/05/12 16:23:43 INFO driver.MahoutDriver: Program took 26551 ms (Minutes: 0.44251666666666667)


3、查看输出目录

[root@masterclone ~]# hadoop fs -ls /mahout/output/canopy-resultWarning: $HADOOP_HOME is deprecated.Found 1 itemsdrwxr-xr-x   - root supergroup          0 2014-05-12 16:23 /mahout/output/canopy-result/clusters-0-final[root@masterclone ~]# 

 


 

0 0