Hadoop组件之Partitioner

来源:互联网 发布:斯沃数控仿真软件 编辑:程序博客网 时间:2024/05/17 03:18


shuffle是MR处理流程中的一个过程,它的每一个处理步骤是分散在各个map task和reduce task节点上完成的,整体来看,分为3个操作:
1、分区partition
2、Sort根据key排序
3、Combiner进行局部value的合并


Partitioner就是对map输出的key进行分组,不同的组可以指定不同的reduce task处理;

Partition功能由partitioner的实现子类来实现

每写一段代码都会加深理解,程序里记录了自己的理解

1、Partitioner组件可以让Map对Key进行分区,从而可以根据不同key来分发到不同的reduce中去处理。 
2、提供了一个默认的HashPartitioner 
在org.apache.Hadoop.mapreduce.lib.partition.HashPartitioner.Java

/** Partition keys by their {@link Object#hashCode()}. */@InterfaceAudience.Public@InterfaceStability.Stablepublic class HashPartitioner<K, V> extends Partitioner<K, V> {  /** Use {@link Object#hashCode()} to partition. */  public int getPartition(K key, V value,                          int numReduceTasks) {    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  }}

这里可以自定义Partitioner 

1)继承抽象类Partitioner,实现自定义的getPartition()方法 
2)通过job.setPartitionerClass()来设置自定义的Partitioner 


FlowBean.class

package com.bigdata.flowcount.hashpartioner;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.WritableComparable;public class FlowBean implements WritableComparable<FlowBean> {private long upFlow;private long downFlow;private long sumFlow;public FlowBean() {}public FlowBean(long upFlow, long downFlow, long sumFlow) {this.upFlow = upFlow;this.downFlow = downFlow;this.sumFlow = sumFlow;}public long getUpFlow() {return upFlow;}public void setUpFlow(long upFlow) {this.upFlow = upFlow;}public long getDownFlow() {return downFlow;}public void setDownFlow(long downFlow) {this.downFlow = downFlow;}public long getSumFlow() {return sumFlow;}public void setSumFlow(long sumFlow) {this.sumFlow = sumFlow;}/** * 序列化方法 */public void write(DataOutput out) throws IOException {out.writeLong(upFlow);out.writeLong(downFlow);out.writeLong(sumFlow);}/** * 反序列化方法 注意:反序列化的顺序与序列化的顺序完全一致 */public void readFields(DataInput in) throws IOException {this.upFlow = in.readLong();this.downFlow = in.readLong();this.sumFlow = in.readLong();}public int compareTo(FlowBean o) {// TODO Auto-generated method stubreturn 0;}@Overridepublic String toString() {return upFlow + "\t" + downFlow + "\t" + sumFlow;}}

FlowCountMapper.class

package com.bigdata.flowcount.hashpartioner;import java.io.IOException;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> {/** * 读一行,切分字段 抽取手机号,上行流量 下行流量 context.write(手机号,bean) */@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] words = StringUtils.split(line, '\t');String phNum = words[1];if ("200".equals(words[words.length - 1])) {String downFlow = words[words.length - 2];String upFlow = words[words.length - 3];context.write(new Text(phNum),new FlowBean(Long.valueOf(upFlow), Long.valueOf(downFlow), Long.valueOf(0l)));}}}

FlowCountReducer.class

package com.bigdata.flowcount.hashpartioner;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> {@Overrideprotected void reduce(Text key, Iterable<FlowBean> values, Context context)throws IOException, InterruptedException {FlowBean flowBean = new FlowBean();for (FlowBean value : values) {flowBean.setUpFlow(value.getUpFlow() + flowBean.getUpFlow());flowBean.setDownFlow(value.getDownFlow() + flowBean.getDownFlow());}flowBean.setSumFlow(flowBean.getUpFlow() + flowBean.getDownFlow());context.write(key, flowBean);}}

FlowPartionner.class

package com.bigdata.flowcount.hashpartioner;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Partitioner;/** * 对应的是map输出的k,v类型 * @author hadoop * */public class FlowPartionner extends Partitioner<Text, FlowBean> {@Overridepublic int getPartition(Text key, FlowBean value, int numPartitions) {if (key.toString().length() == 11) {return 0;}return 1;}}

FlowCountRunner.class

package com.bigdata.flowcount.hashpartioner;import java.net.URI;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import com.bigdata.flowcount.hashpartioner.FlowPartionner;public class FlowCountRunner {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();conf.set("mapreduce.framework.name", "yarn");conf.set("yarn.resoucemanager.hostname", "zh");Job job = Job.getInstance(conf, "flowCount");job.setJar("/home/hadoop/workspace/fcp.jar");job.setJarByClass(FlowCountRunner.class);job.setMapperClass(FlowCountMapper.class);job.setReducerClass(FlowCountReducer.class);job.setNumReduceTasks(2);job.setPartitionerClass(FlowPartionner.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(FlowBean.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(FlowBean.class);FileInputFormat.setInputPaths(job, new Path("hdfs://zh:9000/flow/input/"));FileSystem fs = FileSystem.get(new URI("hdfs://zh:9000/"), conf);Path path = new Path("hdfs://zh:9000/flow/output/");if (fs.exists(path)) {fs.delete(path);}fs.close();FileOutputFormat.setOutputPath(job, path);// 将job中配置的相关参数,以及job所用的java类所在的jar包,提交给yarn去运行int res = job.waitForCompletion(true) ? 0 : 1;System.out.println(res);System.exit(res);}}

分理处数据中不是手机号的数据,分别放到不同的part中,测试数据:


运行程序后:

产生part-r-00000文件和part-r-00001文件,从文件内容中可以看出,手机号与非手机号进行分开了。

 

part-r-00000文件内容

part-r-00001文件内容


注意:

如果reduceTask的数量>=getPartition的结果数,则会多产生几个空的输出文件,part-r-000xx; 

如果reduceTask的数量<getPartition的结果数,则有一部分分区数据无处安放,或exception!!!

如果reduceTask的数量=1,则不管mapTask端输出多少个分区文件,最终结果都会交给这一个reduceTask,最终只会产生一个结果文件part-r-00000