MapReduce编程(四) 求均值

来源:互联网 发布:犀牛建模软件 编辑:程序博客网 时间:2024/06/05 15:22

一、问题描述

三个文件中分别存储了学生的语文、数学和英语成绩,输出每个学生的平均分。

数据格式如下:
Chinese.txt

张三    78李四    89王五    96赵六    67

Math.txt

张三    88李四    99王五    66赵六    77

English.txt

张三    80李四    82王五    84赵六    86

二、MapReduce编程

package com.javacore.hadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;/** * Created by bee on 3/29/17. */public class StudentAvgDouble {    public static class MyMapper extends Mapper<Object, Text, Text, DoubleWritable> {        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {           String eachline = value.toString();           StringTokenizer tokenizer = new StringTokenizer(eachline, "\n");            while (tokenizer.hasMoreElements()) {                StringTokenizer tokenizerLine = new StringTokenizer(tokenizer                        .nextToken());                String strName = tokenizerLine.nextToken();                String strScore = tokenizerLine.nextToken();                Text name = new Text(strName);                IntWritable score = new IntWritable(Integer.parseInt(strScore));                context.write(name, score);            }        }    }    public static class MyReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {        public void reduce(Text key, Iterable<DoubleWritable> values, Context                context) throws IOException, InterruptedException {            double sum = 0.0;            int count = 0;            for (DoubleWritable val : values) {                sum += val.get();                count++;            }            DoubleWritable avgScore = new DoubleWritable(sum / count);            context.write(key, avgScore);        }    }    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {        //删除output文件夹        FileUtil.deleteDir("output");        Configuration conf = new Configuration();        String[] otherArgs = new String[]{"input/studentAvg", "output"};        if (otherArgs.length != 2) {            System.out.println("参数错误");            System.exit(2);        }        Job job = Job.getInstance();        job.setJarByClass(StudentAvgDouble.class);        job.setMapperClass(MyMapper.class);        job.setReducerClass(MyReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(DoubleWritable.class);        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));        System.exit(job.waitForCompletion(true) ? 0 : 1);    }}

三、StringTokenizer和Split的用法对比

map函数里按行读入,每行按空格切开,之前我采用的split()函数切分,代码如下。

 String eachline = value.toString(); for (String eachline : lines) {                System.out.println("eachline:\t"+eachline);                String[] words = eachline.split("\\s+");                Text name = new Text(words[0]);                IntWritable score = new IntWritable(Integer.parseInt(words[1]));                context.write(name, score);            }

这种方式简单明了,但是也存在缺陷,对于非正常编码的空格有时候会出现切割失败的情况。
StringTokenizer是java.util包中分割解析类,StringTokenizer类的构造函数有三个:

  1. StringTokenizer(String str):java默认的分隔符是“空格”、“制表符(‘\t’)”、“换行符(‘\n’)”、“回车符(‘\r’)。
  2. StringTokenizer(String str,String delim):可以构造一个用来解析str的StringTokenizer对象,并提供一个指定的分隔符。
  3. StringTokenizer(String str,String delim,boolean returnDelims):构造一个用来解析str的StringTokenizer对象,并提供一个指定的分隔符,同时,指定是否返回分隔符。

    StringTokenizer和Split都可以对字符串进行切分,StringTokenizer的性能更高一些,分隔符如果用到一些特殊字符,StringTokenizer的处理结果更好。

四、运行结果

张三  82.0李四  90.0王五  82.0赵六  76.66666666666667
1 0