基于MapReduce框架的K-means算法实现

来源：互联网发布：mac可以接鼠标吗编辑：程序博客网时间：2024/05/22 06:56

1. K-means算法的非形式化描述

非定一个N个对象的集合，要将这些对象分组到K个簇中，k-means算法需要完成以下

步骤：

1）将N个对象划分到K个非空子集。

2）计算当前分区中心的簇质心（质心是这个簇的中心点或平均点）。

3）将各个对象分配到有最近质心的簇。

4）如果不在有新的分配，则停止计算。否则返回步骤2.。

这个算法会反复迭代，直到质心不再发生改变，此时就找到了我们想要的K个簇。

2.K-means均值距离函数

采用欧式距离，即设二维平面上两点a(x1,y1)与b(x2,y2)间的欧氏距离为：

3. MapReduce的解决方案

1.main函数读取质心文件

2. 将质心的字符串放到configuration中

3. 在mapper类重写setup方法，获取到configuration的质心内容，解析成二维数组的形式，代表质心

4. mapper类中的map方法读取样本文件，跟所有的质心比较，得出每个样本跟哪个质心最近，然后输出<质心，样本>

5. reducer类中重新计算质心，如果重新计算出来的质心跟进来时的质心一致，那么自定义的counter加1

6. main中获取counter的值，看是否等于质心数量，如果不相等，那么继续迭代，否则退出

3.1 预备阶段（读取簇质心文件）

分两种读取，一种是第一次读取客户给定的簇质心文件，另一种是读取reduce输出的簇质心文件

public class Center { protected static int k = 2;     //质心的个数  ，每次都输出两个质心      //拿到初始的保存在hdfs文件中的初始质心 public String loadInitCenter(Path path) throws Exception{ StringBuffer sb = new StringBuffer();  Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://example:9000"),conf,"hadoop"); FSDataInputStream din = fs.open(path); //往目标文件上兑一根输入流 LineReader in = new LineReader(din,conf);//包装  Text line = new Text(); while(in.readLine(line) > 0){//读到的一行数据放入line对象中，若其长度大于0 sb.append(line.toString().trim());//则保存进buffer中 sb.append("\t");//用\t间隔 } return sb.toString().trim(); }  //拿到后来reduce重新生成的质心 public String loadCenter(Path path)throws Exception{//sb中保存每个文件中的k个质心，每个质心用\t隔开 StringBuffer sb = new StringBuffer();   Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); //拿到reduce输出目录下的所有文件RemoteIterator<LocatedFileStatus> files = fs.listFiles(path, false);while(files.hasNext()){LocatedFileStatus lfs = files.next();//过滤掉非簇质心所在文件if(!lfs.getPath().getName().contains("part"))continue; FSDataInputStream din = fs.open(lfs.getPath()); LineReader in = new LineReader(din); Text line = new Text(); while(in.readLine(line) > 0){ sb.append(line.toString().trim()); sb.append("\t"); }}return sb.toString().trim(); }  }

3.2 Mapper阶段

1）在预处理阶段去拿到给定的初始簇质心（setup方法）

2）对文件中读到的每行“向量” 拿出来计算与每个质心之间的距离

3）保存与输入点有最小距离的簇质心

4）输出键是离输入点最近的簇质心值是该向量

static class K_meansMapper extends Mapper<LongWritable, Text, Text, Text>{String centerStrArray[] = null; //每个元素代表一个簇质心d维坐标的字符串，之间用“，”隔开double centers[][] = new double[Center.k][];//每个元素代表一个簇质心在d维空间中某一维的坐标//预处理，收集初始簇质点；@Overrideprotected void setup(Context context)throws IOException, InterruptedException {//得到上一轮聚类后的簇质心String centerSource = context.getConfiguration().get(FLAG);System.out.println(centerSource);centerStrArray = centerSource.split("\t"); //得到所有簇质心组成的字符串数组for(int i=0;i<centerStrArray.length;i++){String centerStr[] = centerStrArray[i].split(",");//得到每个质心的维度坐标的字符串数组centers[i] = new double[centerStr.length];for(int j=0;j<centerStr.length;j++){centers[i][j] = Double.parseDouble(centerStr[j]);}}}//@Overrideprotected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {String line = value.toString();String vector[] = line.split(",");double sample[] = new double[vector.length];for(int i=0;i<vector.length;i++){sample[i] = Double.parseDouble(vector[i]);}double min = Double.MAX_VALUE;//记录最小距离int index = 0;//记录最小距离的簇质点//计算每个输入点与簇质心的距离，并且找出距离当前点的最近簇质心for(int i=0;i<centers.length;i++){double d = distance(sample,centers[i]);if(min > d){min = d;index = i;}}//输出<簇质点，向量>context.write(new Text(centerStrArray[index]), value);}}

3.3 Combiner阶段

各个映射任务之后，会应用combiner来组合映射任务的中间数据。组合器将累加向量对象的各个维的值，并计算当前的平均值。Combine()函数是在map阶段输出后，临时内存中数据溢出时开始执行，故其相当于是在本地做了合并，然后将合并的值通过网络传输给reduce，这样一来就可以充分的减少网络传输流量，从而提高算法的执行效率。

static class K_meansCombinner extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values,Context context)throws IOException, InterruptedException {int len = key.toString().split(",").length;double center[] =  new double[len];int size = 0;Iterator<Text> iterator = values.iterator();while(iterator.hasNext()){String centerStr[] = iterator.next().toString().split(",");for(int i=0;i<len;i++){center[i] += Double.parseDouble(centerStr[i]);}size++;}StringBuffer sb = new StringBuffer();for(int i=0;i<center.length;i++){center[i] /= size;sb.append(center);sb.append(",");}sb.deleteCharAt(sb.toString().length() - 1);context.write(key, new Text(sb.toString()));}}

3.4 Reducer阶段

1）重新计算簇中心

2）每个归约器迭代处理各个值向量，计算其平均值。将计算好的平均值当做下一个簇中心，并输出

3）比较新质点与老质点，若小于阙值则将自定义的counter加1

static class K_meansReducer extends Reducer<Text, Text, Text, NullWritable>{Counter counter = null;@Overrideprotected void reduce(Text key, Iterable<Text> values,Context context)throws IOException, InterruptedException {int len = key.toString().split(",").length;double newCenter[] = new double[len]; //保存新生成的簇中心int size = 0; //记录传过来的簇中有多少向量for(Text value : values){String centerStr[] = value.toString().split(",");//拿到所有d维空间中的“点”信息for(int i=0;i<centerStr.length;i++){//将其对应的空间轴坐标累加，方便后面求均值newCenter[i] += Double.parseDouble(centerStr[i]);}size++;}//由StringBuffer保存的新的聚类簇的质心坐标StringBuffer sb = new StringBuffer();for(int i=0;i<newCenter.length;i++){newCenter[i] /= size;//求平均值sb.append(newCenter[i]);sb.append(",");}sb.deleteCharAt(sb.toString().length()-1);//拿到由map传过来的上一轮产生的簇质心坐标String oldCenterStr[] = key.toString().split(",");double oldCenter[] = new double[oldCenterStr.length];for(int i=0;i<oldCenterStr.length;i++){oldCenter[i] = Double.parseDouble(oldCenterStr[i]);}//新质心同老质心比是否发生变化boolean flag = changed(oldCenter,newCenter);  //若有变化则将计数器+1, 代表已经由一个最终簇的质心确定if(flag){//第一个是计数器组的名称，第二是计数器的名称counter = context.getCounter("myCounter", "kmenasCounter");counter.increment(1l);}context.write(new Text(sb.toString().trim()), NullWritable.get());}}

3.5 辅助方法

主要包括两个，一个是判断新老质心的改变是否收敛，另一个是欧氏距离函数

//两组质心的改变是否收敛public static boolean changed(double oldCenter[],double newCenter[]){for(int i=0;i<oldCenter.length;i++){if(oldCenter[i] - newCenter[i] > 0.0000001){return false;}}return true;}//欧氏距离public static double distance(double center[],double data[]){double sum = 0;for(int i=0;i<center.length;i++){sum += Math.pow(center[i]-data[i], 2);}return Math.sqrt(sum);}

3.5 main方法

1）决定输出输入目录

2）读入文件中的簇质心信息，并将其转换为字符串放入configuration中

3）通过自定义的counter控制迭代次数

public static void main(String[] args) throws Exception {Path inputPath = new Path("/kmeans/input");Path centerPath = new Path("/kmeans/output/center.txt");Center center = new Center();String centerStr = center.loadInitCenter(centerPath); //拿到初始化的质心int index = 0;while(true){Configuration conf = new Configuration();conf.set(FLAG, centerStr);//将其放入Configuration中//将初始化的质心目录改为reduce的输出目录，也是下一轮的质心所在目录centerPath = new Path("/kmeans/output"+index);Job job = Job.getInstance(conf, "kmeans" + index);job.setJarByClass(K_means.class);job.setMapperClass(K_meansMapper.class);job.setReducerClass(K_meansReducer.class);job.setCombinerClass(K_meansCombiner.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(NullWritable.class);FileInputFormat.setInputPaths(job, inputPath);FileOutputFormat.setOutputPath(job, centerPath);//提交job.waitForCompletion(true);//★★★★//获取自定义counter的大小，若等于k值则说明已经得到最终结果//"myCounter", "kmenasCounter"Counter counter = job.getCounters().getGroup("myCounter").findCounter("kmenasCounter");long countValue= counter.getValue();if(countValue == Center.k)System.exit(0);else{//若程序未退出，则重新加载reduce输出的新质心 ，此时不再用初始化加载counter.setValue(0l);centerStr = center.loadCenter(centerPath);index++;}}}

4.输入与输出

实例样本：

1,1  2,2  3,3  -3,-3  -4,-4  -5,-5

初始质心：

1,1  2,2

聚类结果：

-4.0，-4.02.0，2.0

参考博客：http://blog.csdn.NET/nwpuwyk/article/details/29564249?utm_source=tuicool&utm_medium=referral

阅读全文

0 0