mapreduce学习笔记-二次排序(自定义数据类型,自定义分区分组)
来源:互联网 发布:如何编制软件 编辑:程序博客网 时间:2024/04/26 23:39
所谓二次排序就是,当第一个字段相同,第二个字段也要进行排序
现在我们有一个文件如下:
2013 23.6 01-022011 18.5 12-122016 24.2 08-222011 19.4 03-052014 20.4 11-022013 18.6 10-032012 22.3 09-182015 17.4 05-30
该文件记录这某公司某年某月商品的价格
第一列为年份,第二列为价格,第三列为日期
要求:
1.输出 每一天的价格 按照日期排序
2.将同一年份的数据划分到同一个分区
解决思路:
1.自定义Key数据类型,实现Writable接口。
2.自定义分区函数类,实现Partitioner接口,此为Key的第一次比较,在Job中使用setPartitionerClass设置。
3.自定义分组类,继承WritableComparator类,在Job中使用setGroupingComparatorClass设置。在Reduce
阶段,构造一个与Key相对应的Value迭代器的时候,只要first相同就属于同一个组,放在一个Value迭代器中。
一.在wordCount项目中新建一个SecondarySort类,
同样继承Configured类,实现Tool接口,并添加自定义数据类型类,SecondarySortWritable类
package com.demo.hadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;import org.apache.hadoop.util.Tool;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;public class SecondarySort extends Configured implements Tool { private static Configuration configuration; static { //配置hadoop configuration = new Configuration(); configuration.set("fs.defaultFS", "hdfs://yangyi:8020"); } @Override public int run(String[] args) throws Exception { return 0; } public static class SecondarySortWritable implements WritableComparable<SecondarySortWritable> { //年份 private int year; //价格 private double price; //日期 private String date; public int getYear() { return year; } public double getPrice() { return price; } public String getDate() { return date; } @Override public void write(DataOutput out) throws IOException { out.writeInt(year); out.writeDouble(price); out.writeUTF(date); } @Override public void readFields(DataInput in) throws IOException { this.year = in.readInt(); this.date = in.readUTF(); this.price = in.readDouble(); } public SecondarySortWritable(int year, double price, String date) { this.year = year; this.date = date; this.price = price; } public SecondarySortWritable() { } @Override public int compareTo(SecondarySortWritable o) { int result; if (this.year == o.year) result = this.date.compareTo(o.date); else result = this.year > o.year ? 1 : -1; return result; } @Override public String toString() { return year + " " + date + " " + price; } }}
二.添加mapper类和reducer类
mapper类
public static class SecondarySortMapper extends Mapper<LongWritable, Text, SecondarySortWritable, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); //拆分字段 String[] fields = line.split(" "); int year = Integer.valueOf(fields[0]); double price = Double.valueOf(fields[1]); String date = fields[2]; SecondarySortWritable outPutKey = new SecondarySortWritable(year, price, date); context.write(outPutKey, new Text("")); }}
reducer类
public static class SecondarySortReducer extends Reducer<SecondarySortWritable, Text, SecondarySortWritable, Text> { @Override protected void reduce(SecondarySortWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { context.write(key, new Text("")); }}
三.自定义分区规则(为了让相同年份的数据划分到相同的分区,需要重新定义分区规则)
public static class SecondarySortPartitioner extends Partitioner<SecondarySortWritable, Text> { @Override public int getPartition(SecondarySortWritable secondarySortWritable, Text text, int numPartitions) { return secondarySortWritable.getYear() % numPartitions; }}
四.自定义分组(在同一个分区里面,具有相同Key值的记录是属于同一个分组的)
public static class SecondarySortGroup extends WritableComparator { public SecondarySortGroup() { super(SecondarySortWritable.class, true); } @Override public int compare(WritableComparable a, WritableComparable b) { return ((SecondarySortWritable) a).getYear() - ((SecondarySortWritable) b).getYear(); }}
五.写run方法逻辑
@Overridepublic int run(String[] args) throws Exception { //生成job Job job = Job.getInstance(configuration, "secondary-sort"); job.setJarByClass(SecondarySort.class); //输入设置 Path inputPath = new Path(args[0]); FileInputFormat.addInputPath(job, inputPath); //输出设置 Path outputPath = new Path(args[1]); FileOutputFormat.setOutputPath(job, outputPath); //配置map job.setMapperClass(SecondarySortMapper.class); job.setMapOutputKeyClass(SecondarySortWritable.class); job.setMapOutputValueClass(Text.class); //设置partition job.setPartitionerClass(SecondarySortPartitioner.class); //设置reduce个数 job.setNumReduceTasks(5); //配置reduce job.setReducerClass(SecondarySortReducer.class); job.setOutputKeyClass(SecondarySortWritable.class); job.setOutputValueClass(Text.class); //设置combine job.setCombinerClass(SecondarySortReducer.class); //设置group job.setGroupingComparatorClass(SecondarySortGroup.class); return job.waitForCompletion(true) ? 1 : -1;}
六.生成main方法,运行SecondarySort二次排序(记得将文件上传到HDFS)
public static void main(String[] args) throws Exception { args = new String[]{ "/user/yang/secondarySort/input/data.txt", "/user/yang/secondarySort/output" }; FileSystem fileSystem = FileSystem.get(configuration); if (fileSystem.exists(new Path(args[1]))) fileSystem.delete(new Path(args[1]), true); SecondarySort secondarySort = new SecondarySort(); secondarySort.run(args);}
查看控制台
File System CountersFILE: Number of bytes read=5848FILE: Number of bytes written=1410100FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=816HDFS: Number of bytes written=323HDFS: Number of read operations=81HDFS: Number of large read operations=0HDFS: Number of write operations=42Map-Reduce FrameworkMap input records=8Map output records=8Map output bytes=160Map output materialized bytes=206Input split bytes=122Combine input records=8Combine output records=8Reduce input groups=6Reduce shuffle bytes=206Reduce input records=8Reduce output records=6Spilled Records=16Shuffled Maps =5Failed Shuffles=0Merged Map outputs=5GC time elapsed (ms)=69CPU time spent (ms)=0Physical memory (bytes) snapshot=0Virtual memory (bytes) snapshot=0Total committed heap usage (bytes)=3680501760Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=136File Output Format Counters Bytes Written=102
查看HDFS
因为我们在run方法中设置job的reduce个数为5,所以有5个分区
yang@hadoop:/opt/modules/hadoop-2.5.0$ bin/hdfs dfs -text /user/yang/secondarySort/output/part-r-0000*Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.17/10/17 05:08:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2015 05-30 17.42011 03-05 19.42011 12-12 18.52016 08-22 24.22012 09-18 22.32013 01-02 23.62013 10-03 18.62014 11-02 20.4yang@hadoop:/opt/modules/hadoop-2.5.0$ bin/hdfs dfs -text /user/yang/secondarySort/output/part-r-00000Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.17/10/17 05:08:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2015 05-30 17.4yang@hadoop:/opt/modules/hadoop-2.5.0$
阅读全文
0 0
- mapreduce学习笔记-二次排序(自定义数据类型,自定义分区分组)
- 十一、理解MapReduce的二次排序功能,包括自定义数据类型、分区、分组、排序
- 自定义分区、数据类型、排序、分组
- mapreduce,自定义分区,分组,排序实现join
- MapReduce的自定义排序、分区和分组
- MapReduce自定义二次排序
- MapReduce处理二次排序(分区-排序-分组)
- Mapreduce中的 自定义类型、分组与二次排序
- MapReduce二次排序分区,分组优化
- MapReduce实现自定义二次排序
- MapReduce-自定义Key-二次排序
- Hadoop 自定义排序,自定义分区,自定义分组
- mapreduce之分区,分组,排序,二次排序的综合应用
- Hadoop Mapreduce分区、分组、二次排序过程详解[转]
- Hadoop Mapreduce分区、分组、二次排序过程详解
- Hadoop Mapreduce分区、分组、二次排序过程详解
- mapreduce,自定义排序,分区,分组实现按照年份升序排序,温度降序排序
- 「 Hadoop」mapreduce对温度数据进行自定义排序、分组、分区等
- 疫情控制 40
- 【JavaScript】js中this关键字及let与var小对比
- Assistant de peinture
- [luogu3008]/[USACO 11JAN]道路和飞机Roads and Planes
- –定义一个圆类(Circle),其所在的包为bzu.info.software;定义一个圆柱类Cylinder,其所在的包为bzu.info.com;定义一个主类A,其所在的包也为bzu.info.
- mapreduce学习笔记-二次排序(自定义数据类型,自定义分区分组)
- 深度学习深理解(三)-浅层神经网络
- 2017.10.16
- Jenkins基础入门-17-什么是Blue Ocean
- Linux系统知识小结(四)
- python3 Django TemplateDoesNotExist at /
- ★ Eclipse Debug 界面应用详解——Eclipse Debug不为人知的秘密
- Makefile 的学习
- 配置IP,putty,xshell远程登录和密钥登录