Dirichlet与LDA聚类

来源：互联网发布：反转二叉树 javascript 编辑：程序博客网时间：2024/06/12 04:38

Dirichlet clustering一种基于贝叶斯分类的聚类算法不只是给出一种聚类结果，而是给出多种聚类方案，一般在其他聚类算法之前运行，聚类结果可以帮我们理解数据并选择更好的聚类算法门限聚类方式(overlap或者分层)和距离度量方式

DirichletClusterer的运行过程：

1.The input Vector data in the List<VectorWritable> format.

2.The NormalModelDistribution as the model distribution we are trying to fit our data on.

NormalModelDistribution尝试聚类使用的数据分布模型

Alpha 门限值：高的话数据会更好的匹配数据模型但是运行速度更慢

Dirichlet中使用的不同分布模型：

1.AsymmetricSampledNormalDistribution.

2.L1ModelDistribution

3.SampledNormalModelDisribution.

Dirichlet 的 Mapreduce运行方式

bin/mahout dirichlet

-i examples/reuters-vectors/

-o reuters-dirichlet-clusters -k 60 -x 10 –a0 1.0

-md org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution

-mp org.apache.mahout.math.SequentialAccessSparseVector

说明：

The Reuters dataset in theVector format

The model distribution class–md defaults toNormalModelDistribution

The model distribution prototypeVector class. The class that becomes the type for all vectors created in the job–mp defaults toSequentialAccessSparseVector

The alpha0value for the distribution, -a0 1.0

The number of clusters to start the clustering with –k 60.

The number of iterations to run the algorithm –x 10

LDA聚类算法类似于Dirichlet聚类算法从空主题模型开始，使用mapper解析每一篇文档，并计算该文档与某主题的相关性（与主题相关的概率），reducer负责将主题模型一般化，并计算全部文档与各个主题模型的相关性，找到与文档相关度最好的模型（达到门限停止）（与k-means不同）

LDA算法接近与搜索引擎的相似文档查找，使用maxDFPercent去除停止词，将文档转化为向量，计算向量的相似度完成聚类