hadoop下实现kmeans一

来源：互联网发布：淘宝不能登陆编辑：程序博客网时间：2024/04/20 02:37

前一段时间，从配置hadoop到运行kmeans的mapreduce程序，着实让我纠结了几天，昨天终于把前面遇到的配置问题和程序运行问题搞定。Kmeans算法看起来很简单，但对于第一次接触mapreduce程序来说，还是有些挑战，还好基本都搞明白了。Kmeans算法是从网上下的在此分析一下过程。

Kmeans.java

[java] view plaincopy
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  
public class KMeans {  
      
    public static void main(String[] args) throws Exception  
    {  
        CenterInitial centerInitial = new CenterInitial();  
        centerInitial.run(args);//初始化中心点  
        int times=0;  
        double s = 0,shold = 0.1;//shold是预制。  
        do {  
            Configuration conf = new Configuration();  
            conf.set("fs.default.name", "hdfs://localhost:9000");  
            Job job = new Job(conf,"KMeans");//建立KMeans的MapReduce作业  
            job.setJarByClass(KMeans.class);//设定作业的启动类  
            job.setOutputKeyClass(Text.class);//设定Key输出的格式：Text  
            job.setOutputValueClass(Text.class);//设定value输出的格式：Text  
            job.setMapperClass(KMapper.class);//设定Mapper类  
            job.setMapOutputKeyClass(Text.class);  
            job.setMapOutputValueClass(Text.class);//设定Reducer类  
            job.setReducerClass(KReducer.class);  
            FileSystem fs = FileSystem.get(conf);  
            fs.delete(new Path(args[2]),true);//args[2]是output目录，fs.delete是将已存在的output删除  
                        //解析输入和输出参数，分别作为作业的输入和输出，都是文件   
                        FileInputFormat.addInputPath(job, new Path(args[0]));  
            FileOutputFormat.setOutputPath(job, new Path(args[2]));  
                        //运行作业并判断是否完成成功  
                        job.waitForCompletion(true);  
            if(job.waitForCompletion(true))//上一次mapreduce过程结束  
            {  
                                //上两个中心点做比较，如果中心点之间的距离小于阈值就停止；如果距离大于阈值，就把最近的中心点作为新中心点  
                                NewCenter newCenter = new NewCenter();  
                s = newCenter.run(args);  
                times++;  
            }  
        } while(s > shold);//当误差小于阈值停止。  
        System.out.println("Iterator: " + times);//迭代次数       
    }  
  
}  

问题：args[]是什么，这个问题纠结了几日才得到答案，args[]就是最开始向程序中传递的参数，具体在Run Configurations里配置，如下

hdfs://localhost:9000/home/administrator/hadoop/kmeans/input hdfs://localhost:9000/home/administrator/hadoop/kmeans hdfs://localhost:9000/home/administrator/hadoop/kmeans/output

代码的功能在程序中注释。

0 0