mahout之聚类实现
来源:互联网 发布:底层软件开发工程师 编辑:程序博客网 时间:2024/05/22 10:58
人们常数"物以类聚,人以群分",聚类就是将一个给定的文档集中相似项目分成不同簇的过程。
聚类设计的过程:
(1)一个聚类算法( k-means、模糊k-means、canopy等)
(2)相似性和不相似性的概念
a.欧式距离
b.平方欧式距离
c. 曼哈顿距离
d.余弦距离测度
e.谷本距离测度
f. 加权距离测度(TF-IDF 词项频率-逆文档频率)
(3)终止的条件
一个基于欧式距离测度的k-means聚类算法java实现如下:
import java.io.File;import java.io.IOException;import java.util.ArrayList;import java.util.List;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.SequenceFile;import org.apache.hadoop.io.Text;import org.apache.mahout.clustering.WeightedVectorWritable;import org.apache.mahout.clustering.kmeans.Cluster;import org.apache.mahout.clustering.kmeans.KMeansDriver;import org.apache.mahout.common.distance.EuclideanDistanceMeasure;import org.apache.mahout.math.RandomAccessSparseVector;import org.apache.mahout.math.Vector;import org.apache.mahout.math.VectorWritable;public class SimpleKMeansClustering { public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2}, {3, 3}, {8, 8}, {9, 8}, {8, 9}, {9, 9}}; public static void writePointsToFile(List<Vector> points, String fileName, FileSystem fs, Configuration conf) throws IOException { Path path = new Path(fileName); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, LongWritable.class, VectorWritable.class); long recNum = 0; VectorWritable vec = new VectorWritable(); for (Vector point : points) { vec.set(point); writer.append(new LongWritable(recNum++), vec); } writer.close(); } public static List<Vector> getPoints(double[][] raw) { List<Vector> points = new ArrayList<Vector>(); for (int i = 0; i < raw.length; i++) { double[] fr = raw[i]; Vector vec = new RandomAccessSparseVector(fr.length); vec.assign(fr); points.add(vec); } return points; } public static void main(String args[]) throws Exception { int k = 2; List<Vector> vectors = getPoints(points); File testData = new File("testdata"); if (!testData.exists()) { testData.mkdir(); } testData = new File("testdata/points"); if (!testData.exists()) { testData.mkdir(); } Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); writePointsToFile(vectors, "testdata/points/file1", fs, conf); Path path = new Path("testdata/clusters/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, Cluster.class); for (int i = 0; i < k; i++) { Vector vec = vectors.get(i); Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure()); writer.append(new Text(cluster.getIdentifier()), cluster); } writer.close(); KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"), new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10, true, false); SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("output/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while (reader.next(key, value)) { System.out.println(value.toString() + " belongs to cluster " + key.toString()); } reader.close(); } }
0 0
- mahout之聚类实现
- Mahout之聚类Canopy分析
- mahout 源码解析之聚类--MinHash
- Mahout学习之聚类算法Kmeans
- mahout之canopy聚类算法
- Mahout聚类算法学习之Canopy算法的分析与实现
- mahout 聚类实战
- mahout 聚类大全
- mahout 聚类实例
- mahout聚类实例
- mahout 0.9 + hadoop 1.0.2 实现中文文本聚类
- Canopy聚类算法与Mahout中的实现
- mahout 源码解析之聚类--聚类模型
- mahout 源码解析之聚类--聚类策略
- mahout 源码解析之聚类--聚类分类模型
- Mahout文本聚类学习之DocumentProcessor类
- mahout之聚类算法——KMeans分析
- mahout 源码解析之聚类--Canopy算法
- 查找linux进程所在的目录
- This version of Android Studio is incompatible with the Gradle Plugin used.
- spring+mybatis缓存的问题及源码
- 浅析scala传名调用和传值调用,: => 与() : =>
- 获取Android设备电池电量状态
- mahout之聚类实现
- oracle优化
- 清除行列
- 解决Could not read from remote repository问题,为GitHub账号添加SSH Keys。图片案例
- [求助]关于java的心跳包程序出现java.net.SocketException: Software caused connection abort: socket write error
- OpenGL编程指南14:混合半透明Blend
- kafka的相关组件介绍
- Android实现intent跳转界面传递数据(1)
- 【设计模式】单例模式