大数据-Hadoop学习笔记09

来源：互联网发布：美分知乎编辑：程序博客网时间：2024/05/22 14:13

30.MapReduce

    mapreduce任务过程分为两个处理阶段：map阶段和reduce阶段。每个阶段都以k-v对作为输入和输出，其类型由开发者选择。    map阶段的输入时NCDC原始数据。我们选择文本格式作为输入格式，将数据集的每一行作为文本输入。1.编写MR程序【创建mapper】

public class MyMaxTempMapper extends Mapper<LongWritable, Text, Text, IntWritable>{    private static final int MISSING = 9999;    /**     * mapper     */    @Override    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)            throws IOException, InterruptedException {        //value String        String line = value.toString();        //提取年份        String year = line.substring(15, 19);        //提取气温        int airTemperature;        if (line.charAt(87) == '+') {            airTemperature = Integer.parseInt(line.substring(88, 92));        } else {            airTemperature = Integer.parseInt(line.substring(87, 92));        }        //质量        String quality = line.substring(92, 93);        //判断气温的有效性        if (airTemperature != MISSING && quality.matches("[01459]")) {            context.write(new Text(year), new IntWritable(airTemperature));        }    }}

【创建Reducer】

public class MyMaxTempReducer extends Reducer<Text, IntWritable, Text, IntWritable>{    @Override    protected void reduce(Text key, Iterable<IntWritable> values,            Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {        //定义一个最小值        int maxValue = Integer.MIN_VALUE;        //提取年度最高气温        for(IntWritable value : values) {            maxValue = Math.max(maxValue, value.get());        }        //写入输出        context.write(key, new IntWritable(maxValue));    }}

【创建app运行作业】

public class MyMaxTempApp {    public static void main(String[] args) throws Exception {        if (args.length != 2) {            System.out.println("Usage: MaxTemperature <input path> <output path>");            System.exit(1);         }        Job job = Job.getInstance();        job.setJarByClass(MyMaxTempApp.class);        //设置作业名称        job.setJobName("Max temp");        //输入路径        FileInputFormat.addInputPath(job, new Path(args[0]));        //输出路径        FileOutputFormat.setOutputPath(job, new Path(args[1]));        //设置mapper类型        job.setMapperClass(MyMaxTempMapper.class);        //设置reducer类型        job.setReducerClass(MyMaxTempReducer.class);        //设置对应输出key的类型        job.setOutputKeyClass(Text.class);        //设置对应输出value的类型        job.setOutputValueClass(IntWritable.class);        //开始执行job        System.out.println(job.waitForCompletion(true) ? 0 : 1);    }}

31.Job提交过程分析

【编程模型】    map（映射） + reduce（化简）【调用流程】    1.job.waitForCompletion()    2.submit()提交作业给cluster，并等待完成      a)ensureState(JobState.DEFINE)确保状态      b)setUseNewAPI()使用新型API      c)connect()创建集群对象      d)创建JobSubmitter    3.submitter.submitJobInternal(Job.this, cluster)      a)checkSpecs(job)检查输出目录，已存在则出异常      b)JobSubmissionFiles.getStagingDir()建立hdfs的临时目录      c)InetAddress.getLocalHost()取得本地ip      d)submitClient.getNewJobID()创建作业id      e)copyAndConfigureFiles()使用命令行参数设置conf信息      f)writeSplits(job, submitJobDir)在临时目录下产生创建切片文件      g)conf.setInt(MRJobConfig.NUM_MAPS, maps)设置map数量      h)writeConf(conf, submitJobFile)提交job.xml到提交目录      i)submitClient.submitJob()通过执行器提交作    4.submitClient.submitJob()通过执行器提交作业      a)Job job = new Job()创建LocalJobRunner.Job内部类对象    5.Job job = new Job()      a)通过临时目录下job.xml创建JobConf      b)this.start()启动线程，即调用run()方法    6.this.start()      a)TaskSplitMetaInfo[]获取task切片信息      b)getMapTaskRunnables()得到mapper对应的runnable      c)runTasks(mapRunnables, mapService, "map")      d)getReduceTaskRunnables()得到reduce的任务数      e)runTasks(reduceRunnables, reduceService, "reduce")    7.runTasks()      for (Runnable r : runnables) {          service.submit(r);      }    8.LocalJobRunner$Job$MapTaskRunnable      a)创建MapAttempId      b)创建MapTask      c)创建MaoOutFile      d)map.setXXX()      e)map.run()      f)    9.org.apache.hadoop.mapred.MapTask$run()      a)runNewMapper()    10.runNewMapper      a)创建taskContext      b)taskContext.getMapperClass()用反射拿到Mapper对象      c)创建InputFormat      d)创建split      e)创建NewOutputCollector即context对象    11.mapper.run(mapperContext)    12.MyMaxTempMapper$run()      setup(context);      try {          while (context.nextKeyValue()) {              map(...)          }      } finally {          cleanup()      }【提交作业至集群】    编译打包后，在集群中通过    hadoop jar jarFile classname arg1 arg2 ..    进行作业的提交

0 0