MapReduce on Avro Data Files
来源:互联网 发布:mysql order by 优化 编辑:程序博客网 时间:2024/05/30 04:52
MapReduce on Avro Data Files
Related MicroZone Resources
Build Big Data Apps with JavaScript and Django
Download Hunk: Splunk Analytics for Hadoop
Get Started Developing with Splunk, the Platform for Machine Data
Like this piece? Share it with your friends:
In this post we are going to write a MapReduce program to consume Avro input data and also produce data in Avro format.
We will write a program to calculate average of student marks.
Data Preparation
The schema for the records is:
student.avsc{ "type" : "record", "name" : "student_marks", "namespace" : "com.rishav.avro", "fields" : [ { "name" : "student_id", "type" : "int" }, { "name" : "subject_id", "type" : "int" }, { "name" : "marks", "type" : "int" } ]}
And some sample records are:
student.json{"student_id":1,"subject_id":63,"marks":19}{"student_id":2,"subject_id":64,"marks":74}{"student_id":3,"subject_id":10,"marks":94}{"student_id":4,"subject_id":79,"marks":27}{"student_id":1,"subject_id":52,"marks":95}{"student_id":2,"subject_id":34,"marks":16}{"student_id":3,"subject_id":81,"marks":17}{"student_id":4,"subject_id":60,"marks":52}{"student_id":1,"subject_id":11,"marks":66}{"student_id":2,"subject_id":84,"marks":39}{"student_id":3,"subject_id":24,"marks":39}{"student_id":4,"subject_id":16,"marks":0}{"student_id":1,"subject_id":65,"marks":75}{"student_id":2,"subject_id":5,"marks":52}{"student_id":3,"subject_id":86,"marks":50}{"student_id":4,"subject_id":55,"marks":42}{"student_id":1,"subject_id":30,"marks":21}
Now we will convert the above sample records to avro format and upload the avro data file to HDFS:
java -jar avro-tools-1.7.5.jar fromjson student.json --schema-file student.avsc > student.avrohadoop fs -put student.avro student.avro
Avro MapReduce Program
In my program I have used Avro Java class for student_marks schema. To generate Java class from the schema file use below command:
java -jar avro-tools-1.7.5.jar compile schema student.avsc .
Then add the generated Java class to IDE.
I have written a MapReduce program which reads Avro data file student.avro (passed as argument) and calculates average marks for each student and store the output also in Avro format. The program is given below:
package com.rishav.avro.mapreduce;import java.io.IOException;import org.apache.avro.Schema;import org.apache.avro.mapred.AvroKey;import org.apache.avro.mapred.AvroValue;import org.apache.avro.mapreduce.AvroJob;import org.apache.avro.mapreduce.AvroKeyInputFormat;import org.apache.avro.mapreduce.AvroKeyValueOutputFormat;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import com.rishav.avro.IntPair;import com.rishav.avro.student_marks;public class AvroAverageDriver extends Configured implements Tool{ public static class AvroAverageMapper extends Mapper<AvroKey<student_marks>, NullWritable, IntWritable, IntPair> { protected void map(AvroKey<student_marks> key, NullWritable value, Context context) throws IOException, InterruptedException { IntWritable s_id = new IntWritable(key.datum().getStudentId()); IntPair marks_one = new IntPair(key.datum().getMarks(), 1); context.write(s_id, marks_one); } } // end of mapper class public static class AvroAverageCombiner extends Reducer<IntWritable, IntPair, IntWritable, IntPair> { IntPair p_sum_count = new IntPair(); Integer p_sum = new Integer(0); Integer p_count = new Integer(0); protected void reduce(IntWritable key, Iterable<IntPair> values, Context context) throws IOException, InterruptedException { p_sum = 0; p_count = 0; for (IntPair value : values) { p_sum += value.getFirstInt(); p_count += value.getSecondInt(); } p_sum_count.set(p_sum, p_count); context.write(key, p_sum_count); } } // end of combiner class public static class AvroAverageReducer extends Reducer<IntWritable, IntPair, AvroKey<Integer>, AvroValue<Float>> { Integer f_sum = 0; Integer f_count = 0; protected void reduce(IntWritable key, Iterable<IntPair> values, Context context) throws IOException, InterruptedException { f_sum = 0; f_count = 0; for (IntPair value : values) { f_sum += value.getFirstInt(); f_count += value.getSecondInt(); } Float average = (float)f_sum/f_count; Integer s_id = new Integer(key.toString()); context.write(new AvroKey<Integer>(s_id), new AvroValue<Float>(average)); } } // end of reducer class @Override public int run(String[] rawArgs) throws Exception { if (rawArgs.length != 2) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getName()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Job job = new Job(super.getConf()); job.setJarByClass(AvroAverageDriver.class); job.setJobName("Avro Average"); String[] args = new GenericOptionsParser(rawArgs).getRemainingArgs(); Path inPath = new Path(args[0]); Path outPath = new Path(args[1]); FileInputFormat.setInputPaths(job, inPath); FileOutputFormat.setOutputPath(job, outPath); outPath.getFileSystem(super.getConf()).delete(outPath, true); job.setInputFormatClass(AvroKeyInputFormat.class); job.setMapperClass(AvroAverageMapper.class); AvroJob.setInputKeySchema(job, student_marks.getClassSchema()); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(IntPair.class); job.setCombinerClass(AvroAverageCombiner.class); job.setOutputFormatClass(AvroKeyValueOutputFormat.class); job.setReducerClass(AvroAverageReducer.class); AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.INT)); AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.FLOAT)); return (job.waitForCompletion(true) ? 0 : 1); } public static void main(String[] args) throws Exception { int result = ToolRunner.run(new AvroAverageDriver(), args); System.exit(result); }}
- In the program the input key to mapper is AvroKey<student_marks> and the input value is null. The output key of map method is student_id and output value is anIntPair having marks and 1.
- We have a combiner also which aggregates partial sums for each student_id.
- Finally reducer takes student_id and partial sums and counts and uses them to calculate average for each student_id. The reducer writes the output in Avro format.
For Avro job setup we have added these properties:
// set InputFormatClass to AvroKeyInputFormat and define input schema job.setInputFormatClass(AvroKeyInputFormat.class); AvroJob.setInputKeySchema(job, student_marks.getClassSchema());// set OutputFormatClass to AvroKeyValueOutputFormat and key as INT type and value as FLOAT type job.setOutputFormatClass(AvroKeyValueOutputFormat.class); AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.INT)); AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.FLOAT));
Job Execution
We package our Java program to avro_mr.jar and add Avro jars to libjars and hadoop classpath using below commands:
export LIBJARS=avro-1.7.5.jar,avro-mapred-1.7.5-hadoop1.jar,paranamer-2.6.jarexport HADOOP_CLASSPATH=avro-1.7.5.jar:avro-mapred-1.7.5-hadoop1.jar:paranamer-2.6.jarhadoop jar avro_mr.jar com.rishav.avro.mapreduce.AvroAverageDriver -libjars ${LIBJARS} student.avro output
You can verify the output using avro-tool command.
To enable snappy compression for output add below lines to run method and add snappy-java jar to libjars and hadoop classpath:
FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)
- Avro
- Big Data
- MapReduce
- Tutorial
- Tools & Methods
- MapReduce on Avro Data Files
- MapReduce處理檔案實驗(1) Avro
- Avro:MapReduce应用
- MapReduce: Simplified Data Processing on Large Clusters
- MapReduce:Simplified Data Processing On Large Clusters
- MapReduce: Simplified Data Processing on Large Clusters
- MapReduce: Simplified Data Processing on Large Clusters
- MapReduce: Simplified Data Processing on Large Clusters
- Avro:使用Avro MapReduce进行排序
- mapreduce操作avro的问题
- Data Files
- Google MapReduce:Simpli ed Data Processing on Large Clusters
- MapReduce: Simplified Data Processing on Large Clusters(转并改)
- 论文阅读笔记 - MapReduce : Simplified Data Processing on Large Clusters
- MapReduce :Simpliyed Data Processing on Large Clusters 总结
- Data compression on Hbase will make your mapreduce job fly
- MapReduce: Simplified Data Processing on Large Clusters 中文翻译 1
- MapReduce: Simplified Data Processing on Large Clusters 中文翻译 2
- ECSHOP打印订单显示省市区详细地址
- Android 手把手教您自定义ViewGroup(一)
- int类型的图片转换成drawable,Bitmap 类型
- flask-cache‘s key_prefix
- 处理关于Component 'TABCTL32.OCX' or one of its dependencies not correctyly registered:类的错误:
- MapReduce on Avro Data Files
- 数据仓库概念
- [Unity3D]BoxCollider、SphereCollider、CapsuleCollider的性能对比
- 简单又复杂的三层网络转发技术
- 雷军的投资、创业观
- Effective STL笔记一-容器
- 关于universal-image-loader的Https请求设置Cookie问题解决方案。
- Android之Service
- Linux的mount命令详解