Hadoop之MapReduce-Partition编程

来源:互联网 发布:空调的选购 知乎 编辑:程序博客网 时间:2024/05/16 18:59

一、问题描述

       在Hadoop序列化案例(http://blog.csdn.net/gaijianwei/article/details/46004025)的基础上,将输出的数据按照手机号所属的运营商进行分区。

二、问题实现

       DataCount代码(只是对Hadoop序列化案例的DataCount代码稍作修改)

package edu.jianwei.hadoop.mr;import java.io.IOException;import java.util.HashMap;import java.util.Map;import org.apache.commons.collections.map.HashedMap;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Mapper.Context;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class  DataCount {static class DCMapper extends Mapper<LongWritable, Text, Text, DataBean>{        private Text k=new Text();        private DataBean v=new DataBean();@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line=value.toString();String[] words=line.split("\t");  String telNum=words[1];double upLoad=Double.parseDouble(words[8]);double downLoad=Double.parseDouble(words[9]);k.set(telNum);v.Set(telNum, upLoad, downLoad);context.write(k, v);}}static class DCReduce extends Reducer<Text,DataBean, Text, DataBean>{ private DataBean v=new DataBean();@Overrideprotected void reduce(Text key, Iterable<DataBean> v2s,Context context)throws IOException, InterruptedException {double upTotal=0;double downToal=0;for (DataBean d : v2s) {upTotal+=d.getUpLoad();downToal+=d.getDownload();}v.Set("", upTotal, downToal);context.write(key, v);}}public static class DCPartitioner  extends Partitioner<Text, DataBean>{         static Map<String,Integer> provider=new HashMap<String,Integer>();         static{         provider.put( "139",1);         provider.put( "138",1);         provider.put( "152",2);             provider.put("153", 2);     provider.put("182", 3);        provider.put("183", 3);                  }@Overridepublic int getPartition(Text k, DataBean value, int numPartitions) {String tel_sub=k.toString().substring(0,3);Integer counter;    counter=provider.get(tel_sub);    if(counter==null){    counter=0;    }return counter;}}public static void main(String[] args) throws Exception { Configuration conf=new Configuration(); Job job=Job.getInstance();  job.setJarByClass(DataCount.class);  job.setMapperClass(DCMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DataBean.class); FileInputFormat.setInputPaths(job, new Path(args[0]));  job.setReducerClass(DCReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(DataBean.class); FileOutputFormat.setOutputPath(job, new Path(args[1]));  job.setPartitionerClass(DCPartitioner.class); job.setNumReduceTasks(Integer.parseInt(args[2]));  job.waitForCompletion(true);}}
     DataBean同Hadoop序列化案例中的DataBean

三、代码测试

       1.代码运行(启动4个Reduce任务)

          hadoop jar /root/dc.jar edu.jianwei.hadoop.mr.DataCount  /dc   /dc/res   4

       2.运行结果

       

       这里输出结果不在一一列举, 例part-r-00001的数据:

        13826544101     264.0   0.0     264.0
        13922314466     3008.0  3720.0  6728.0
        13925057413     11058.0 48243.0 59301.0
        13926251106     240.0   0.0     240.0
        13926435656     132.0   1512.0  1644.0

     注意:

      1’. 代码运行(启动3个Reduce任务)

          hadoop jar /root/dc.jar edu.jianwei.hadoop.mr.DataCount  /dc/HTTP_20130313143750.dat  /dc/res_3  3

      2‘.运行结果

        

       1’‘.代码运行(启动5个Reduce任务)

           hadoop jar /root/dc.jar edu.jianwei.hadoop.mr.DataCount  /dc/HTTP_20130313143750.dat  /dc/res_3  3

       2''.运行结果

         

0 0
原创粉丝点击