从零搭建Hadoop集群三

来源：互联网发布：mysql 删除表记录编辑：程序博客网时间：2024/04/28 07:17

接上文，看了几篇文章，自己尝试着写了两个hadoop处理文件的小代码，惭愧啊，人家四五年前玩的东西，现在才想起来要尝试着学习一下，还磕磕绊绊的。>_<

求平均数

我们有两个文件，里面保存了一些人的语文和数学成绩，格式如下：
testAvg.txt

张三 语文 88李四 语文 77王五 语文 66张三 数学 90李四 数学 79王五 数学 68

testAvg2.txt

赵六 语文 88赵六 数学 90

先在hdfs里创建一个文件夹scoreAvg，将这两个数据文件用-put命令放到hdfs的文件系统内。
然后在上文最后创建的testProject里新建一个class文件。
文件内容其实很简单，就是重写一下Hadoop的map和reduce函数。
Map函数，我的理解就是先对输入文件里一行一行的进行处理，整理好Key和Value的分组，然后再扔给Hadoop。
Reduce函数，就是将同一个Key对应的所有数据拿到一起进行处理，在这个函数里不需要考虑其他Key的影响。

        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)        throws IOException {            String line = value.toString();            StringTokenizer tokenizer = new StringTokenizer(line, "\n");            while (tokenizer.hasMoreElements()) {                //对每行数据进行分割                StringTokenizer tokenizerForLine = new StringTokenizer(tokenizer.nextToken());                //名称                String studentName = tokenizerForLine.nextToken();                //科目名，这个在我们的示例里中不需要使用                String subjectName = tokenizerForLine.nextToken();                //成绩                String score       = tokenizerForLine.nextToken();                Text name = new Text(studentName);                int scoreInt = Integer.parseInt(score);                //将每个人的成绩输出给Reduce，studentName就是Key，这样同一个人的所有成绩都会被输出给Reduce                output.collect(name, new IntWritable(scoreInt));            }        }    }

再来看看Reduce：

    public static class Reduce extends MapReduceBase implements     Reducer<Text, IntWritable, Text, IntWritable>     {        //可以看到，Reduce的输入是一个Key对应一个迭代器，也就是对应一组数据        public void reduce(Text key, Iterator<IntWritable> values,                 OutputCollector<Text, IntWritable> output, Reporter reporter)        throws IOException         {            int scoreSum = 0;            int subjectCounter = 0;            while (values.hasNext()) {                //将这个用户的所有成绩相加，并计算科目的总数                scoreSum += values.next().get();                subjectCounter ++;            }            //取平均            int scoreAvg = (int) scoreSum / subjectCounter;            //输出            output.collect(key, new IntWritable(scoreAvg));        }    }

最后再实现一个main函数来配置hadoop的配置信息，基本看函数的字面意思就知道配置的是什么内容，注意BasicConfigurator.configure()这句一定要配置，否则很多出错信息都不会显示：

    public static void main(String[] args) throws Exception {        //configure很重要！很重要！很重要！        //重要的事情要说三遍，如果没有这句，hadoop的很多出错信息看不到，定位起来会一头雾水        BasicConfigurator.configure();        JobClient client = new JobClient();        JobConf job = new JobConf(AvgScore.class);        job.setJobName("AvgScore");        //配置Hdfs的路径        job.set("fs.default.name", "hdfs://192.168.245.128:9000");        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        job.setMapperClass(Map.class);        job.setCombinerClass(Reduce.class);        job.setReducerClass(Reduce.class);        job.setInputFormat(TextInputFormat.class);        job.setOutputFormat(TextOutputFormat.class);        //配置输入和输出路径，注意如果输出目录已经存在，会报错，在重跑之前需要删除掉输出目录        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.245.128:9000//scoreAvg"));        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.245.128:9000//scoreAvgOutput"));        client.setConf(job);        JobClient.runJob(job);    }

OK，代码堆完，Run on Hadoop:
run

没有问题的话就可以去输出目录查看结果了：
result1

看看结果，符合预期 ^_^
result2

表连接处理

这个例子里两个文件里的格式是不一样的，第一个文件里的内容是每个人对应的地区的代码：

张三 1李四 1周五 2赵六 3

第二个文件的内容是地区编码的解码：

1 火星2 水星3 地球

我们所要做的工作就是将这两张表都处理一下，得到用户和地区的对应输出，如下图：
result3

代码与上面的示例基本相同，重写map和reduce函数而已。
先看map，这里通过判断文件的首位字符来判断是位置文件还是用户文件，并加以区分处理：

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)        throws IOException {        try {            String line = value.toString();            StringTokenizer tokenizer = new StringTokenizer(line, "\n");            while (tokenizer.hasMoreElements()) {                StringTokenizer tokenizerForLine = new StringTokenizer(tokenizer.nextToken());                String value0 = tokenizerForLine.nextToken();                String value1 = tokenizerForLine.nextToken();                //这里通过每行的第一个字符是否是数字0-9来判断是哪个文件                if (value0.charAt(0) >= '0' && value0.charAt(0) <= '9') {                    //位置文件，以位置的区域代码为Key，在位置的字符串前增加Location用以标识                    output.collect(new Text(value0), new Text("Location" + " " + value1));                }else {                    //用户区域文件，以位置的区域代码为Key，在用户的字符串前增加User用以标识                    output.collect(new Text(value1), new Text("User" + " " + value0));                }            }        } catch (Exception e) {            e.printStackTrace();        }        }    }

再来看看Reduce函数，因为我们在Map里是用位置的编码作为Key，所以自然的同样的Key的数据会被扔到同一个Reduce里进行处理，同样的Key的数据包含位置和用户的数据，在Reduce中我们将其保存到两个数组中，并记录其数量，最后再输出：

    public static class Reduce extends MapReduceBase implements     Reducer<Text, Text, Text, Text>     {        public void reduce(Text key, Iterator<Text> values,                 OutputCollector<Text, Text> output, Reporter reporter)        throws IOException         {        try {            //简单的处理，定义了两个固定长度数组来保存，实际应用中自然不能如此使用            String[] users = new String[5];            String[] locations = new String[5];            int userCounter = 0;            int locationCounter = 0;            while (values.hasNext()) {                String line = values.next().toString();                StringTokenizer tokenizer = new StringTokenizer(line);                String userOrLocation = tokenizer.nextToken();                String currValue = tokenizer.nextToken();                if(userOrLocation.equals("Location")) {                    //如果是位置信息，存入到Location数组                    locations[locationCounter] = currValue;                    locationCounter ++;                }else {                    //如果是用户信息，存入到User数组                    users[userCounter] = currValue;                    userCounter ++;                }            }            //同样的位置代码，会进入同一个Reduce进行处理            //只有该Key，也就是该位置代码对应的位置解码和用户数量都大于0，才需要输出            //其实locationCounter也就是位置解码的数目是必然是1的  >_<            if (userCounter > 0 && locationCounter >0) {                for (int i=0; i<userCounter; i++) {                    for (int j=0; j<locationCounter; j++) {                        output.collect(new Text(users[i]), new Text(locations[j]));                    }                }            }        } catch (Exception e) {            e.printStackTrace();        }        }    }

main函数中注意combine的class不能再设置为reduce啦，具体原理待我仔细研究完Hadoop的几个基本操作后再详述吧 >_<。

    public static void main(String[] args) throws Exception {        BasicConfigurator.configure();        JobClient client = new JobClient();        JobConf job = new JobConf(FindLocation.class);        job.setJobName("FindLocation");        job.set("fs.default.name", "hdfs://192.168.245.128:9000");        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(Text.class);        job.setMapperClass(Map.class);        //如果还是使用Reduce作为CombinerClass，得到的结果是不正确的        //job.setCombinerClass(Reduce.class);        job.setReducerClass(Reduce.class);        job.setInputFormat(TextInputFormat.class);        job.setOutputFormat(TextOutputFormat.class);        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.245.128:9000//findLocation"));        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.245.128:9000//findLocationOutput"));        client.setConf(job);        JobClient.runJob(job);    }

完整的代码还是看我的github吧。 -_-!!!

阅读全文

0 0

从零搭建Hadoop集群 三

求平均数

表连接处理

从零搭建Hadoop集群三