Mahout聚类质量度量与hadoop的应用

来源：互联网发布：io域名哪里注册编辑：程序博客网时间：2024/05/02 20:30

Mahout聚类质量度量

聚类输出检查

ClusterDumper Mahout聚类结果检查类，位置org.apache.mahout.utils.clustering

输出文档分类最重要的特点基准

ClusterDumper运行参数介绍与demo

Option

Flag

Description

Default Value

SequenceFile dir (String)

-s

The directory containing theSequenceFile of the clusters

N/A

Output (String)

-o

The output file, if not specified print the output into the console

N/A

Points Directory (String)

-p

At the end of clustering, Mahout clustering algorithms produce two kinds of output. One is the set of <cluster-id, centroid> pair, other is the set <point-id, cluster-id> pair. The latter is generated when clustering finishes and usually resides in the points folder under the output. When this parameter is set to the points folder, all the points in a cluster are written to the output

N/A

JSON output (bool)

-j

If set, the centroid is written as a JSON format. Otherwise it substitutes in the terms for vector cell entries. By default this flag is unset

N/A

Dictionary (String)

-d

The path to the dictionary file which has the reverse mapping of integer id to word

N/A

Dictionary Type (String)

-dt

Format of the dictionary file. If text, then the integer id, and the terms should be tab separated. If the format is sequencefile, it should have an Integer key and a String value

text

Number of Words (int)

-n

The number of top terms to print.

运行demo：

bin/mahout clusterdump

-s kmeans-output/clusters-19/

-o output.txt

-d reuters-vectors/dictionary.file-0

-dt sequencefile -n 10

聚类输出分析

1. 距离度量和特征选择文本相似度分析师cos好于欧几里得距离

2. 聚类间距离与聚类内距离度量

聚类质量改善

1.文本向量生成改进（Lucene Analyzer）：从Analyzer接口派生，重载

tokenStream的实现

2. 自定义距离度量实现从DistanceMeasure 接口派生，重载distance函数

Mahout在hadoop上的应用

1. 使用SparseVector而不是DenseVector（矩阵一般为稀疏矩阵，SparseVector速度相对快很多）

2. 创建DistanceMeasure的注意事项：1）避免clone或者实例化新vector，2）注意只访问非0元素（使用Vector.iterateNonZero()而不是Vector.iterator()） 3)注意vector的访问效率

3. 使用equentialAccesssSparseVector而不是RandomAccessSparseVector.

4. 使用适当的vector（稀疏的DenseVector可能带来磁盘I/O）

5. 使用HDFS（文件有3份备份，可以防止网络I/O瓶颈）

6. 减少聚类数目（降低计算量最快方法）