hadoop学习之Mapreduce(2.4.1):写mapreduce程序时编写自己的writable类

来源:互联网 发布:淘宝2016年交易规模 编辑:程序博客网 时间:2024/06/06 04:32

前言:

在编写mapreduce程序时,hadoop提供的writable基本数据类型(如IntWritable,DoubleWritable,Text等等,它们都继承自Writable接口)往往不能满足要求。

比如:我们需要在map或者reduce阶段调用context.write(key,value)方法向外写数据,官网给出的wordcount例子,只需以word为key,数值为value向外写,这些都是单一的数据类型。然而,我们常常要往外写的key或者value包含多个数据类型(相同的或者不同的),如:已知某学生的成绩单如下:

1:小明:98:78:892:小花:87:79:98

这些字段分别为-学号:姓名:语文成绩:数学成绩:英语成绩
现在,我们要统计学生的总成绩,那么写mapreduce程序时,map阶段往外写的就是各科的成绩,然后在reduce阶段汇总。
此时,hadoop提供的基本类型无法使用。我们可以将多种基本类型组合使用,创建我们自己的writable类型。

一,下面就介绍一下如何实现自己的writable类。

首先来看hadoop的Writable接口的源码:

public interface Writable {  /**    * Serialize the fields of this object to <code>out</code>.   *    * @param out <code>DataOuput</code> to serialize this object into.   * @throws IOException   */  void write(DataOutput out) throws IOException;  /**    * Deserialize the fields of this object from <code>in</code>.     *    * <p>For efficiency, implementations should attempt to re-use storage in the    * existing object where possible.</p>   *    * @param in <code>DataInput</code> to deseriablize this object from.   * @throws IOException   */  void readFields(DataInput in) throws IOException;}

接口声明2个函数,注释写的很清楚,write()方法是将变量序列化,以便传输。readFields()方法是将序列化的变量反序列化为java基本类型。

同样的,我们自己的MyWritable()类也要继承自Writable接口,并且要实现接口中的两个函数。

下面是我们的MyWritable()类的代码:

package com.jimmy.scoreCount;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.Writable;public class MyWritable implements Writable{//首先定义要求和的“各科成绩”变量和“总成绩”变量//这些变量是要被一起序列化和反序列化的private int yuwen;private int shuxue;private int yingyu;private int score;//生成默认构造函数//生成含参数(各学科成绩变量)的构造函数public MyWritable() {} //hadoop的Writable类用到了动态代理和反射机制,所以要有默认构造函数。public MyWritable(int yuwen, int shuxue, int yingyu) {this.yuwen = yuwen;this.shuxue = shuxue;this.yingyu = yingyu;this.score = yuwen+shuxue+yingyu;}//根据成员变量设置set和get方法,在eclipse中很快生成。//set,get方法不是必须的,提供获取变量的函数,可能在map或reduce阶段为获取单个成员变量提供方便。public int getYuwen() {return yuwen;}public void setYuwen(int yuwen) {this.yuwen = yuwen;}public int getShuxue() {return shuxue;}public void setShuxue(int shuxue) {this.shuxue = shuxue;}public int getYingyu() {return yingyu;}public void setYingyu(int yingyu) {this.yingyu = yingyu;}public int getScore() {return score;}public void setScore(int score) {this.score = score;}/** * 序列化函数,这是每个自定义Writable类必须要实现的函数 */@Overridepublic void write(DataOutput out) throws IOException {//写出是有顺序的,按顺序写出,但是要和readFields()方法读入的顺序一致。out.writeInt(yuwen);out.writeInt(shuxue);out.writeInt(yingyu);out.writeInt(score);}/** * 反序列化函数,也是每个自定义Writable类必须要实现的函数 */@Overridepublic void readFields(DataInput in) throws IOException {//同样,读入也是有顺序的,要和write()方法写出的顺序一致。yuwen = in.readInt();shuxue = in.readInt();yingyu = in.readInt();score = in.readInt();}}

二,有了自定义Writable类,就可以写mapreduce程序算总成绩了。

package com.jimmy.scoreCount;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;//mapper类class MyMapper extends Mapper<LongWritable, Text, Text, MyWritable>{@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();//得到一行数据String[] ss = line.split(",");//数据按逗号切分后,放入数组int yuwen = Integer.parseInt(ss[2]); //取出数组中的相应成绩int shuxue = Integer.parseInt(ss[3]); int yingyu = Integer.parseInt(ss[4]); //map阶段以name为key,MyWritable对象为value写出//将各科成绩传入MyWritable构造函数,作为一个整体在节点之间传递context.write(new Text(ss[1]), new MyWritable(yuwen,shuxue,yingyu));}}class MyReducer extends Reducer<Text, MyWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<MyWritable> value,Context context) throws IOException, InterruptedException {int score = 0;for(MyWritable mm:value)score += mm.getScore();context.write(key, new IntWritable(score));}}public class ScoreCount {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf, "flow");job.setJarByClass(ScoreCount.class);job.setMapperClass(MyMapper.class);job.setReducerClass(MyReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(MyWritable.class);FileInputFormat.addInputPath(job, new Path("hdfs://node1:9000/user/root/input_score"));FileOutputFormat.setOutputPath(job, new Path("hdfs://node1:9000/user/root/output_scorecount"));System.exit(job.waitForCompletion(true)?0:1);}}

运行即得到每个学生的总成绩。

0 0
原创粉丝点击