Mahout之k-means算法源码分析

来源:互联网 发布:r语言查看源码 编辑:程序博客网 时间:2024/05/18 02:44


org.apache.mahout.clustering.syntheticcontrol.kmeans.run(Configuration conf, Path input, Path output,DistanceMeasure measure, int k, double convergenceDelta,int maxIterations),这是我们分析的起点:


public static void run(Configuration conf, Path input, Path output,DistanceMeasure measure, int k, double convergenceDelta,int maxIterations) throws Exception {Path directoryContainingConvertedInput = new Path(output,DIRECTORY_CONTAINING_CONVERTED_INPUT);log.info("Preparing Input");InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");log.info("Running random seed to get initial clusters");Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);log.info("Running KMeans");KMeansDriver.run(conf, directoryContainingConvertedInput, clusters,output, measure, convergenceDelta, maxIterations, true, false);// run ClusterDumperClusterDumper clusterDumper = new ClusterDumper(finalClusterPath(conf,output, maxIterations), new Path(output, "clusteredPoints"));clusterDumper.printClusters(null);}

1.  将输入路径testdata下的数据文件转换格式:对于文件中的每一行,将其转化成一个VectorWritable对象,以顺序文件的形式写入到output/data目录下,对应的key为这一行数据的值的个数(向量的维度)。这些操作是由代码段完成:

log.info("Preparing Input");InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");

2. 从第一步生成的文件中随机的选择k个VetorWritable对象作为初始k个聚类,每一个聚类中仅有一个选择的点,这是由代码段完成的:

log.info("Running random seed to get initial clusters");Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);

随机选择算法很简单,大致流程如下:对于输入路径(即output/data,存放着转换格式后的数据文件)下的文件(可能有多个),选择最开始的k个点(VectorWritable对象),由每个点组成一个聚类,由此得到k个聚类,对于文件中剩下的每一个点,以k/(k+1)的概率替换掉k个聚类中的一个聚类(这个聚类是随机选取的),最后将随机生成的k个Cluster保存在output/clusters-0目录下的 part-randomSeed文件中


3. 最后来分析一下k-meas算法的核心部分 :

log.info("Running KMeans");KMeansDriver.run(conf, directoryContainingConvertedInput, clusters,output, measure, convergenceDelta, maxIterations, true, false);

在run()方法中有:

Path clustersOut = buildClusters(conf, input, clustersIn, output,measure, maxIterations, delta, runSequential);if (runClustering) {log.info("Clustering data");clusterData(conf, input, clustersOut, new Path(output,AbstractCluster.CLUSTERED_POINTS_DIR), measure, delta,runSequential);}

在buildClusters()方法中,input即为转换后的数据文件路径output/data,clustersIn即为第二步中生成的part-randomSeed文件。下面具体看一下buildClusters()方法:

public static Path buildClusters(Configuration conf, Path input,Path clustersIn, Path output, DistanceMeasure measure,int maxIterations, String delta, boolean runSequential)throws IOException, InterruptedException, ClassNotFoundException {if (runSequential) {return buildClustersSeq(conf, input, clustersIn, output, measure,maxIterations, delta);} else {return buildClustersMR(conf, input, clustersIn, output, measure,maxIterations, delta);}}

private static Path buildClustersMR(Configuration conf, Path input,Path clustersIn, Path output, DistanceMeasure measure,int maxIterations, String delta) throws IOException,InterruptedException, ClassNotFoundException {boolean converged = false;int iteration = 1;while (!converged && iteration <= maxIterations) {log.info("K-Means Iteration {}", iteration);// point the output to a new directory per iterationPath clustersOut = new Path(output, AbstractCluster.CLUSTERS_DIR+ iteration);converged = runIteration(conf, input, clustersIn, clustersOut,measure.getClass().getName(), delta);// now point the input to the old output directoryclustersIn = clustersOut;iteration++;}Path finalClustersIn = new Path(output, AbstractCluster.CLUSTERS_DIR+ (iteration - 1) + "-final");FileSystem.get(conf).rename(new Path(output, AbstractCluster.CLUSTERS_DIR+ (iteration - 1)), finalClustersIn);return finalClustersIn;}
 每一次迭代生成生成的文件放在不同的目录(output/clusters-i),在runIteration()方法中启动一个mapReduce任务将输入文件中的全部点划分到不同的Cluster中,并更新Cluster的相关属性:

private static boolean runIteration(Configuration conf, Path input,Path clustersIn, Path clustersOut, String measureClass,String convergenceDelta) throws IOException, InterruptedException,ClassNotFoundException {conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());conf.set(KMeansConfigKeys.DISTANCE_MEASURE_KEY, measureClass);conf.set(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY, convergenceDelta);Job job = new Job(conf,"KMeans Driver running runIteration over clustersIn: "+ clustersIn);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(ClusterObservations.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Cluster.class);job.setInputFormatClass(SequenceFileInputFormat.class);job.setOutputFormatClass(SequenceFileOutputFormat.class);job.setMapperClass(KMeansMapper.class);job.setCombinerClass(KMeansCombiner.class);job.setReducerClass(KMeansReducer.class);FileInputFormat.addInputPath(job, input);FileOutputFormat.setOutputPath(job, clustersOut);job.setJarByClass(KMeansDriver.class);HadoopUtil.delete(conf, clustersOut);if (!job.waitForCompletion(true)) {throw new InterruptedException("K-Means Iteration failed processing " + clustersIn);}FileSystem fs = FileSystem.get(clustersOut.toUri(), conf);return isConverged(clustersOut, conf, fs);}

 KmeansMapper中,setup()方法通过读取当前clusterIn目录,获取上一次迭代结束时生成的k个聚类,将其保存在Collection<Cluster> clusters中,这样,mapper就可以访问到这k个聚类
protected void setup(Context context) throws IOException,InterruptedException {super.setup(context);Configuration conf = context.getConfiguration();DistanceMeasure measure = ClassUtils.instantiateAs(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY),DistanceMeasure.class);measure.configure(conf);this.clusterer = new KMeansClusterer(measure);String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);if (clusterPath != null && !clusterPath.isEmpty()) {KMeansUtil.configureWithClusterInfo(conf, new Path(clusterPath),clusters);if (clusters.isEmpty()) {throw new IllegalStateException("No clusters found. Check your -c path.");}}}

在map 方法中,对于每一个输入点,调用KMeansClusterer类的emitPointToNearestCluster(point.get(), this.clusters,context)方法,找出离这个点最近的Cluster,最后,将这个最近的Cluster的id以及一个ClusterObservations对象写入到map输出中:

public void emitPointToNearestCluster(Vector point,Iterable<Cluster> clusters,Mapper<?, ?, Text, ClusterObservations>.Context context)throws IOException, InterruptedException {Cluster nearestCluster = null;double nearestDistance = Double.MAX_VALUE;for (Cluster cluster : clusters) {Vector clusterCenter = cluster.getCenter();double distance = this.measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);if (log.isDebugEnabled()) {log.debug("{} Cluster: {}", distance, cluster.getId());}if (distance < nearestDistance || nearestCluster == null) {nearestCluster = cluster;nearestDistance = distance;}}context.write(new Text(nearestCluster.getIdentifier()),new ClusterObservations(1, point, point.times(point)));}

 KMeansCombiner将一个map任务输出中key相同的键值对进行合并,以减少传递给reducer的传输量:

protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)    throws IOException, InterruptedException {    Cluster cluster = new Cluster();    for (ClusterObservations value : values) {      cluster.observe(value);    }    context.write(key, cluster.getObservations());}

KMeansReducer中,setup()方法通过读取当前clusterIn目录,获取上一次迭代结束时生成的k个聚类,将其保存在一个HashMap(clusterMap变量)中。

protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)    throws IOException, InterruptedException {    Cluster cluster = clusterMap.get(key.toString());    for (ClusterObservations delta : values) {      cluster.observe(delta);    }    // force convergence calculation    boolean converged = clusterer.computeConvergence(cluster, convergenceDelta);    if (converged) {      context.getCounter("Clustering", "Converged Clusters").increment(1);    }    cluster.computeParameters();    context.write(new Text(cluster.getIdentifier()), cluster);  }

在reduce方法中,根据待迭代的key在clusterMap中取出对应的Cluster,因为key相同的键值对对应了一个Cluster中所有的点,所以遍历这个key对应的values,就可以遍历这个cluster的所有点。对每个value(ClusterObservations对象)调用cluster.observe(value)方法,在该方法中,实际上是将每个ClusterObservations的S0累加到cluster的S0属性中,S1累加到cluster的S1中,S2累加到cluster的S2中,并且在初始时,每个cluster的S0=0,S1=null,S2=null,这个从 cluster.computeParameters()方法中可以看出来。我的理解是S0是cluster中的点的个数,S1是cluster中每个点对应的VectorWritable对象各个分量的和构成的一个Vector,S2是cluster中每个点对应的VectorWritable对象各个分量的平方的和构成的一个Vector。

 

         迭代完每个key对应的values之后,这个cluster对应的S0(起始为0),S1(起始为null),S2(起始为null)三个属性的值都被赋与了新的值,可以利用这些值来计算当前cluster是否收敛:clusterer.computeConvergence(cluster, convergenceDelta),如果收敛,则将已经收敛的Cluster的计数器加1,然后,调用computeParameters方法计算这个cluster的各个其它属性值,包括numPoints,center,radius这三个属性,同时将S0置0,S1,S2置null :

public void computeParameters() {if (getS0() == 0) {return;}setNumPoints((int) getS0());setCenter(getS1().divide(getS0()));// compute the component stdsif (getS0() > 1) {setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new SquareRootFunction()).divide(getS0()));}setS0(0);setS1(null);setS2(null);}

最后将这个cluster的id作为key,cluster本身作为value写入到输出clustersOut(output/clusters-iteration),这里可以看一下一个Cluster的write()方法是如何将其写入到流中的,它的write()方法调用了父类的write()方法,我将它综合了一下,在Cluster的write方法中主要写入了以下内容:

out.writeUTF(measure.getClass().getName());out.writeInt(id);out.writeLong(getNumPoints());VectorWritable.writeVector(out, getCenter());VectorWritable.writeVector(out, getRadius());out.writeBoolean(converged);

 可以看到,写入的内容有DistanceMeasure对象,cluster的id,点的个数,center,radius,是否收敛。


        再回到runIteration()方法中,最后要判断一下执行完本次迭代之后,是否所有的Cluster已经全部收敛。执行完mapReduce任务之后,本次迭代的结果(得到的k个Cluster)写入到了输出文件中clustersOut(output/clusters-iteration),这里读取该目录下的文件(如果有多少reducer的话,会生成多个文件,各个Cluster分布在这多个文件中),取出每一个Cluster,判断它是否已经收敛,只要有一个是不收敛的,就直接返回false,表明还没有全局收敛,要继续执行下一次迭代。

private static boolean isConverged(Path filePath, Configuration conf,FileSystem fs)throws IOException {for (FileStatus part : fs.listStatus(filePath, PathFilters.partFilter())) {SequenceFileValueIterator<Cluster> iterator = new SequenceFileValueIterator<Cluster>(part.getPath(), true, conf);while (iterator.hasNext()) {Cluster value = iterator.next();if (!value.isConverged()) {Closeables.closeQuietly(iterator);return false;}}}return true;}

       至此,一次迭代全部完成,它返回了这次迭代完成之后,是否已经达到全部Cluster 收敛。此时,程序返回到buildClustersMR()方法中,将本次迭代输出目录clusterOut(保存了最新得到了k个Cluster)赋值给clusterIn,即下次从这里读取最新的Cluster,同时迭代次数加1,如果还没有全部收敛并且没有达到最大迭代次数,则继续执行下一次迭代。while循环结束时,获取保存最终全部cluster文件所在的目录路径,并将其重新命名,即在后面加上了“-final”后缀
Path finalClustersIn = new Path(output, AbstractCluster.CLUSTERS_DIR+ (iteration - 1) + "-final");FileSystem.get(conf).rename(new Path(output, AbstractCluster.CLUSTERS_DIR+ (iteration - 1)), finalClustersIn);return finalClustersIn;

至此,buildClusters()方法执行完毕,程序返回到KMeansDriver.run()方法中,执行代码:

if (runClustering) {log.info("Clustering data");clusterData(conf, input, clustersOut, new Path(output,AbstractCluster.CLUSTERED_POINTS_DIR), measure, delta,runSequential);}

clusterData方法调用clusterDataMR()方法启动一个mapReduce任务,根据最后生成的clustersOut目录(保存了最终的每个Cluster的相关信息)和转换后的原始数据文件,决定各个Cluster分别包含哪些点。结果写入了output/clusteredPoints目录。至此KMeansDriver.run()方法结束。


最后 Job.run()方法使用ClusterDumper输出具体聚类结果(每个Cluster包含哪些点,点数,center,radius等)。


原创粉丝点击