Mahout K-Means输出结果解析

来源：互联网发布：幽灵虎淘宝快10万了编辑：程序博客网时间：2024/05/18 02:15

怎么使用Mahout做聚类有空我会专门写的，这篇博客主要为了讲一下Mahout处理的结果。
Mahout版本为0.9，数据没做归一化、标准化，只是为了测试。

输出目录下有clusteredPoints、cluster-x、cluster-（x+1）-final等几个文件夹，x表示第x次迭代，每次的迭代结果都会存到cluster-x，最后一次（x+1）迭代结果存在cluster-（x+1）-final，clusteredPoints下存的也是最后聚类结果，但它俩存的东西不太一样，一个是类，一个是点，具体情况请看下面。
ps：
这里写图片描述

mahout clusterdump 解析ClusterWritable并转成可读文件 -of TEXT，CSV等，后面有贴的

#最后聚类结果（类名称vl-x，中心点位置c，半径r，类中点个数n）[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -o /home/guo/Desktop/resultVL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}#最后聚类结果（key:所属类，value:权重wt、距离、向量（这是有名字的namedvector，不是普通的哦，之后我也会专门写如何生成））[root@drguo clusteredPoints]# mahout seqdumper -i file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/pointsInput Path: file:/home/guo/Desktop/output/clusteredPoints/part-m-0Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedPropertyVectorWritableKey: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]Key: 1: Value: wt: 0.6106543697821432 distance: 11.445523142259598  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]Key: 1: Value: wt: 0.6113140078611051 distance: 11.775681155103799  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]Key: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]Key: 0: Value: wt: 0.7643111018595771 distance: 6.010195419417895  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]Key: 0: Value: wt: 0.7408819961153278 distance: 7.529533687488249  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]Key: 0: Value: wt: 0.7511412095733683 distance: 7.989789402348321  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]Key: 0: Value: wt: 0.6648742191066574 distance: 9.264811638337692  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]Key: 0: Value: wt: 0.53656917576395 distance: 17.373449130609547  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]Key: 1: Value: wt: 0.5948320024451352 distance: 23.202011407059803  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]Count: 10#将类与点结合输出[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -p file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/cluster-pointVL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}    Weight : [props - optional]:  Point:    0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]    0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]    0.7643111018595771 : [distance=6.010195419417895]: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]    0.7408819961153278 : [distance=7.529533687488249]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]    0.7511412095733683 : [distance=7.989789402348321]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]    0.6648742191066574 : [distance=9.264811638337692]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]    0.53656917576395 : [distance=17.373449130609547]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}    Weight : [props - optional]:  Point:    0.6106543697821432 : [distance=11.445523142259598]: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]    0.6113140078611051 : [distance=11.775681155103799]: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]    0.5948320024451352 : [distance=23.202011407059803]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]

最后贴一下参数选项

seqdumper

Job-Specific Options:                                                             --input (-i) input            Path to job input directory.                      --output (-o) output          The directory pathname for output.                --substring (-b) substring    The number of chars to print out per value        --count (-c)                  Report the count only                             --numItems (-n) numItems      Output at most <n> key value pairs                --facets (-fa)                Output the counts per key.  Note, if there are                                  a lot of unique keys, this can take up a fair                                   amount of memory                                  --quiet (-q)                  Print only file contents.                         --help (-h)                   Print out help                                    --tempDir tempDir             Intermediate output directory                     --startPhase startPhase       First phase to run                                --endPhase endPhase           Last phase to run

clusterdump

Job-Specific Options:                                                             --input (-i) input                         Path to job input directory.         --output (-o) output                       The directory pathname for output.   --outputFormat (-of) outputFormat          The optional output format for the                                              results.  Options: TEXT, CSV, JSON                                              or GRAPH_ML                          --substring (-b) substring                 The number of chars of the                                                      asFormatString() to print            --numWords (-n) numWords                   The number of top terms to print     --pointsDir (-p) pointsDir                 The directory containing points                                                 sequence files mapping input                                                    vectors to their cluster.  If                                                   specified, then the program will                                                output the points associated with                                               a cluster                            --samplePoints (-sp) samplePoints          Specifies the maximum number of                                                 points to include _per_ cluster.                                                The default is to include all                                                   points                               --dictionary (-d) dictionary               The dictionary file                  --dictionaryType (-dt) dictionaryType      The dictionary file type                                                        (text|sequencefile)                  --evaluate (-e)                            Run ClusterEvaluator and                                                        CDbwEvaluator over the input.  The                                              output will be appended to the                                                  rest of the output at the end.       --distanceMeasure (-dm) distanceMeasure    The classname of the                                                            DistanceMeasure. Default is                                                     SquaredEuclidean                     --help (-h)                                Print out help                       --tempDir tempDir                          Intermediate output directory        --startPhase startPhase                    First phase to run                   --endPhase endPhase                        Last phase to run

0 0