Hadoop Partitioner组件

来源:互联网 发布:皇马4:1尤文 知乎 编辑:程序博客网 时间:2024/06/06 07:04

1、Partitioner组件可以让Map对Key进行分区,从而可以根据不同key来分发到不同的reduce中去处理。
2、你可以自定义key的一个分发规则,如数据文件包含不同的省份,而输出的要求是每个省份输出一个文件
3、提供了一个默认的HashPartitioner
在org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.java

package org.apache.hadoop.mapreduce.lib.partition;import org.apache.hadoop.mapreduce.Partitioner;/** Partition keys by their {@link Object#hashCode()}. */public class HashPartitioner<K, V> extends Partitioner<K, V> {  /** Use {@link Object#hashCode()} to partition. */  public int getPartition(K key, V value,                          int numReduceTasks) {    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  }}

4、自定义Partitioner
1)继承抽象类Partitioner,实现自定义的getPartition()方法
2)通过job.setPartitionerClass()来设置自定义的Partitioner
在org.apache.hadoop.mapreduce.Partitioner.java中

package org.apache.hadoop.mapreduce;/**  * Partitions the key space. *  * <p><code>Partitioner</code> controls the partitioning of the keys of the  * intermediate map-outputs. The key (or a subset of the key) is used to derive * the partition, typically by a hash function. The total number of partitions * is the same as the number of reduce tasks for the job. Hence this controls * which of the <code>m</code> reduce tasks the intermediate key (and hence the  * record) is sent for reduction.</p> *  * @see Reducer */public abstract class Partitioner<KEY, VALUE> {  /**    * Get the partition number for a given key (hence record) given the total    * number of partitions i.e. number of reduce-tasks for the job.   *      * <p>Typically a hash function on a all or a subset of the key.</p>   *   * @param key the key to be partioned.   * @param value the entry value.   * @param numPartitions the total number of partitions.   * @return the partition number for the <code>key</code>.   */  public abstract int getPartition(KEY key, VALUE value, int numPartitions);}

Partitioner例子
Partitioner应用情景:
需求:分别统计每种商品的周销售情况
site1的周销售清单:
shoes 20
hat 10
stockings 30
clothes 40

site2的周销售清单:
shoes 15
hat 1
stockings 90
clothes 80

汇总结果:
shoes 35
hat 11
stockings 120
clothes 120

代码如下:
MyMapper.java

package com.partitioner;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {    @Override    protected void map(LongWritable key, Text value,Context context)            throws IOException, InterruptedException {        String[] s = value.toString().split("\\s+") ;        context.write(new Text(s[0]), new IntWritable(Integer.parseInt(s[1]))) ;    }}

MyPartitioner.java

package com.partitioner;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Partitioner;public class MyPartitioner extends Partitioner<Text,IntWritable>{    @Override    public int getPartition(Text key, IntWritable value, int numPartitions) {        if(key.toString().equals("shoes")){            return 0 ;        }        if(key.toString().equals("hat")){            return 1 ;        }        if(key.toString().equals("stockings")){            return 2 ;        }        return 3 ;          }}

MyReducer.java

package com.partitioner;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    @Override    protected void reduce(Text key, Iterable<IntWritable> value,Context context)            throws IOException, InterruptedException {        int sum = 0 ;        for(IntWritable val : value ){            sum += val.get() ;        }        context.write(key, new IntWritable(sum)) ;    }}

TestPartitioner.java

package com.partitioner;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;import org.apache.hadoop.util.GenericOptionsParser;public class TestPartitioner {    public static void main(String args[])throws Exception{        Configuration conf = new Configuration();        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();        if (otherArgs.length != 2) {          System.err.println("Usage: wordcount <in> <out>");          System.exit(2);        }        Job job = new Job(conf, "word count");        job.setJarByClass(TestPartitioner.class);        job.setMapperClass(MyMapper.class);//      job.setCombinerClass(MyCombiner.class);        job.setReducerClass(MyReducer.class);        job.setPartitionerClass(MyPartitioner.class) ;        job.setNumReduceTasks(4) ;        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));        System.exit(job.waitForCompletion(true) ? 0 : 1);    }}
0 0
原创粉丝点击