mapreduce学习笔记-二次排序(自定义数据类型，自定义分区分组)

来源：互联网发布：如何编制软件编辑：程序博客网时间：2024/04/26 23:39

所谓二次排序就是，当第一个字段相同，第二个字段也要进行排序

现在我们有一个文件如下:

2013 23.6 01-022011 18.5 12-122016 24.2 08-222011 19.4 03-052014 20.4 11-022013 18.6 10-032012 22.3 09-182015 17.4 05-30

该文件记录这某公司某年某月商品的价格

第一列为年份，第二列为价格，第三列为日期

要求：

1.输出每一天的价格按照日期排序

2.将同一年份的数据划分到同一个分区

解决思路：

1.自定义Key数据类型，实现Writable接口。

2.自定义分区函数类，实现Partitioner接口，此为Key的第一次比较，在Job中使用setPartitionerClass设置。

3.自定义分组类，继承WritableComparator类，在Job中使用setGroupingComparatorClass设置。在Reduce

阶段，构造一个与Key相对应的Value迭代器的时候，只要first相同就属于同一个组，放在一个Value迭代器中。

一.在wordCount项目中新建一个SecondarySort类，

同样继承Configured类，实现Tool接口，并添加自定义数据类型类，SecondarySortWritable类

package com.demo.hadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;import org.apache.hadoop.util.Tool;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;public class SecondarySort extends Configured implements Tool {    private static Configuration configuration;    static {        //配置hadoop        configuration = new Configuration();        configuration.set("fs.defaultFS", "hdfs://yangyi:8020");    }    @Override    public int run(String[] args) throws Exception {        return 0;    }    public static class SecondarySortWritable implements WritableComparable<SecondarySortWritable> {        //年份        private int year;        //价格        private double price;        //日期        private String date;        public int getYear() {            return year;        }        public double getPrice() {            return price;        }        public String getDate() {            return date;        }        @Override        public void write(DataOutput out) throws IOException {            out.writeInt(year);            out.writeDouble(price);            out.writeUTF(date);        }        @Override        public void readFields(DataInput in) throws IOException {            this.year = in.readInt();            this.date = in.readUTF();            this.price = in.readDouble();        }        public SecondarySortWritable(int year, double price, String date) {            this.year = year;            this.date = date;            this.price = price;        }        public SecondarySortWritable() {        }        @Override        public int compareTo(SecondarySortWritable o) {            int result;            if (this.year == o.year)                result = this.date.compareTo(o.date);            else                result = this.year > o.year ? 1 : -1;            return result;        }        @Override        public String toString() {            return year + " " + date + " " + price;        }    }}

二.添加mapper类和reducer类

mapper类

public static class SecondarySortMapper extends Mapper<LongWritable, Text, SecondarySortWritable, Text> {    @Override    protected void map(LongWritable key, Text value, Context context)            throws IOException, InterruptedException {        String line = value.toString();        //拆分字段        String[] fields = line.split(" ");        int year = Integer.valueOf(fields[0]);        double price = Double.valueOf(fields[1]);        String date = fields[2];        SecondarySortWritable outPutKey = new SecondarySortWritable(year, price, date);        context.write(outPutKey, new Text(""));    }}

reducer类


public static class SecondarySortReducer extends Reducer<SecondarySortWritable, Text, SecondarySortWritable, Text> {    @Override    protected void reduce(SecondarySortWritable key, Iterable<Text> values, Context context)            throws IOException, InterruptedException {        context.write(key, new Text(""));    }}

三.自定义分区规则(为了让相同年份的数据划分到相同的分区，需要重新定义分区规则)

public static class SecondarySortPartitioner extends Partitioner<SecondarySortWritable, Text> {    @Override    public int getPartition(SecondarySortWritable secondarySortWritable, Text text, int numPartitions) {        return secondarySortWritable.getYear() % numPartitions;    }}

四.自定义分组(在同一个分区里面，具有相同Key值的记录是属于同一个分组的)

public static class SecondarySortGroup extends WritableComparator {    public SecondarySortGroup() {        super(SecondarySortWritable.class, true);    }    @Override    public int compare(WritableComparable a, WritableComparable b) {        return ((SecondarySortWritable) a).getYear() - ((SecondarySortWritable) b).getYear();    }}

五.写run方法逻辑

@Overridepublic int run(String[] args) throws Exception {    //生成job    Job job = Job.getInstance(configuration, "secondary-sort");    job.setJarByClass(SecondarySort.class);    //输入设置    Path inputPath = new Path(args[0]);    FileInputFormat.addInputPath(job, inputPath);    //输出设置    Path outputPath = new Path(args[1]);    FileOutputFormat.setOutputPath(job, outputPath);    //配置map    job.setMapperClass(SecondarySortMapper.class);    job.setMapOutputKeyClass(SecondarySortWritable.class);    job.setMapOutputValueClass(Text.class);    //设置partition    job.setPartitionerClass(SecondarySortPartitioner.class);    //设置reduce个数    job.setNumReduceTasks(5);    //配置reduce    job.setReducerClass(SecondarySortReducer.class);    job.setOutputKeyClass(SecondarySortWritable.class);    job.setOutputValueClass(Text.class);    //设置combine    job.setCombinerClass(SecondarySortReducer.class);    //设置group    job.setGroupingComparatorClass(SecondarySortGroup.class);    return job.waitForCompletion(true) ? 1 : -1;}

六.生成main方法，运行SecondarySort二次排序（记得将文件上传到HDFS）

public static void main(String[] args) throws Exception {    args = new String[]{            "/user/yang/secondarySort/input/data.txt",            "/user/yang/secondarySort/output"    };    FileSystem fileSystem = FileSystem.get(configuration);    if (fileSystem.exists(new Path(args[1])))        fileSystem.delete(new Path(args[1]), true);    SecondarySort secondarySort = new SecondarySort();    secondarySort.run(args);}

查看控制台

File System CountersFILE: Number of bytes read=5848FILE: Number of bytes written=1410100FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=816HDFS: Number of bytes written=323HDFS: Number of read operations=81HDFS: Number of large read operations=0HDFS: Number of write operations=42Map-Reduce FrameworkMap input records=8Map output records=8Map output bytes=160Map output materialized bytes=206Input split bytes=122Combine input records=8Combine output records=8Reduce input groups=6Reduce shuffle bytes=206Reduce input records=8Reduce output records=6Spilled Records=16Shuffled Maps =5Failed Shuffles=0Merged Map outputs=5GC time elapsed (ms)=69CPU time spent (ms)=0Physical memory (bytes) snapshot=0Virtual memory (bytes) snapshot=0Total committed heap usage (bytes)=3680501760Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=136File Output Format Counters Bytes Written=102

查看HDFS

因为我们在run方法中设置job的reduce个数为5，所以有5个分区

yang@hadoop:/opt/modules/hadoop-2.5.0$ bin/hdfs dfs -text /user/yang/secondarySort/output/part-r-0000*Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.17/10/17 05:08:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2015 05-30 17.42011 03-05 19.42011 12-12 18.52016 08-22 24.22012 09-18 22.32013 01-02 23.62013 10-03 18.62014 11-02 20.4yang@hadoop:/opt/modules/hadoop-2.5.0$ bin/hdfs dfs -text /user/yang/secondarySort/output/part-r-00000Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.17/10/17 05:08:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2015 05-30 17.4yang@hadoop:/opt/modules/hadoop-2.5.0$

阅读全文

0 0