Mahout聚类质量度量与hadoop的应用

来源:互联网 发布:io域名哪里注册 编辑:程序博客网 时间:2024/05/02 20:30

Mahout聚类质量度量

聚类输出检查

ClusterDumper Mahout聚类结果检查类,位置org.apache.mahout.utils.clustering

输出文档分类最重要的特点基准

ClusterDumper运行参数介绍与demo

 

Option

Flag

Description

Default Value

SequenceFile dir (String)

-s

The directory containing theSequenceFile of the clusters

N/A

Output (String)

-o

The output file, if not specified print the output into the console

N/A

Points Directory (String)

-p

At the end of clustering, Mahout clustering algorithms produce two kinds of output. One is the set of <cluster-id, centroid> pair, other is the set <point-id, cluster-id> pair. The latter is generated when clustering finishes and usually resides in the points folder under the output. When this parameter is set to the points folder, all the points in a cluster are written to the output

N/A

JSON output (bool)

-j

If set, the centroid is written as a JSON format. Otherwise it substitutes in the terms for vector cell entries. By default this flag is unset

N/A

Dictionary (String)

-d

The path to the dictionary file which has the reverse mapping of integer id to word

N/A

Dictionary Type (String)

-dt

Format of the dictionary file. If text, then the integer id, and the terms should be tab separated. If the format is sequencefile, it should have an Integer key and a String value

text

Number of Words (int)

-n

The number of top terms to print.

10

 

运行demo

bin/mahout clusterdump

-s kmeans-output/clusters-19/

-o output.txt

-d reuters-vectors/dictionary.file-0

-dt sequencefile -n 10

 

聚类输出分析

1.      距离度量和特征选择 文本相似度分析师cos好于欧几里得距离

2.      聚类间距离与聚类内距离度量

 

聚类质量改善

1.文本向量生成改进(Lucene Analyzer):从Analyzer接口派生,重载

tokenStream的实现

2. 自定义距离度量实现从DistanceMeasure 接口派生,重载distance函数

 

Mahouthadoop上的应用

1.      使用SparseVector而不是DenseVector(矩阵一般为稀疏矩阵,SparseVector速度相对快很多)

2.     创建DistanceMeasure的注意事项:1)避免clone或者实例化新vector2)注意只访问非0元素(使用Vector.iterateNonZero()而不是Vector.iterator()  3)注意vector的访问效率

3.      使用equentialAccesssSparseVector而不是RandomAccessSparseVector.

4.      使用适当的vector(稀疏的DenseVector可能带来磁盘I/O

5.      使用HDFS(文件有3份备份,可以防止网络I/O瓶颈)

6.      减少聚类数目(降低计算量最快方法)

原创粉丝点击