Hadoop2.X MR作业流

来源：互联网发布：cip数据核字号查询打印编辑：程序博客网时间：2024/06/06 05:17

Hadoop2.X MR作业流

情景概述:作为HFDS的高层建筑,MR被设计与在大型分布式文件系统之上的离线数据运算,在对一些运算时效性要求不高的场景中更适合于MR作业,MR在ETL流不同阶段可扮演不同的角色,甚至在某些场景下基于MR的链式操作可完成ETL的整个流程.

MR概述:Hadoop MR(Mapper Reduce) 是一个软件架构的实现,用户处理大批量的的离线数据作业,运行于大型集群中,硬件可靠和容错.MR核心思想将作业的输入转化为多个并行的Map程序,然后进行Reduce输出运算结果.作业的输入和输出都存储于文件系统,由MR 框架负责作业的调度,监控和失败重试.

| 基于基于计算优于移动数据的设计,MR运算节点与数据节点通常是同一个节点.

| MR框架有一个ResourceManager主节点每一个集群节点上有一个NodeManager每一个应用程序有一个MRAppMaster

| 一个MR程序至少应该有输入/输出,输入Map以及 Reduce函数,通常由实现适当的接口或者抽象类而来. 还应该有其它的作业参数包括作业的配置参数

| Hadoop作业由job client进行提交,jobclient配置ResourceManager,ResourceManager分配资源及集群节点,然后开始调度,监控任务,为job-client提供任务状态和诊断信息

MR作业流,图片来源于互联网

输入,输出:

| MR作业基于<key,value>对展开作业,框架以一组<key,value>作为输入,一组<key,value>作为输出,期间数据类型可以发生变化.

| 分布式作业中,MR要求key,value必须是可序列化的需要实现Writable接口,此外key必须实现WritableComparable接口用于MR框架进行以key为依据的排序操作

| 一个MR作业的输入/输入看起来是如下的一种演变:

(input) <k1, v1> -> map -><k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,v3> (output)

MR接口

| Mapper接口用于将作业的输入映射到key/value集合,Mapper 操作输入key/value集合生成中间key/values记录,一个给定的输入可能对应0个或者多个输出key/values对,Mapper接口定义如下

public class Mapper<KEYIN, VALUEIN,KEYOUT, VALUEOUT> {         …public voidrun(Context context) throws IOException, InterruptedException {}}

| Shuffle,对Maplper阶段的输出数据进行Shuffle操作,期间伴随数据的排序操作.

| Sort,MR框架默认使用keys对Reduce的输入数据进行分组,Mapper操作处理的是Reduce阶段输出按key分组后的数据.

| Secondary Sort,可通过实现Comparator接口通过Job.setSortComparatorClass(Class) 来自定义key的比较规则来实现key的排序,可通过 Job.setGroupingComparatorClass(Class)来指定key分组的规则,

| Reduce 接口用户合并多个Mapper操作的中间结果,一个MR作业中可允许多个Reduce并发处理. Reduce通常涉及三个阶段,shuffle,sort,reduce,Reduce接口如下

public classReducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{…protected void reduce(KEYIN key, Iterable<VALUEIN> values,Context context                        ) throws IOException,InterruptedException {      }}

| Reducer NONE, MR框架允许一个MR作业只有Mapper操作而没有Reduce操作,这样Mapper的输出将直接输出到对应的文件系统,且在输出到文件系统前不会发生排序操作.

| Partitioner, Partitioner 对Mapper操作的中间结果按key进行划分,分区数和Reduce数量一致, Partitioner控制着中间结果的key会被分配到多个Reduce中的哪一个进行Reduce操作.

| Counter, MR提供的计数组件,用于统计MR状态.

附件:

| 一个默认的MR程序,补全了默认的加载项:,该例是完整的可直接运行

public class T {         publicstatic void main(String[] args) throws Exception {                   Configurationconf = new Configuration();                   Jobjob = Job.getInstance(conf, "TEST");                   //默认的输入配置                   {                            /*                             * key:LongWriteable ;value:Text                             */                            job.setInputFormatClass(TextInputFormat.class);                    }                   //Mapper默认配置                   {                            job.setMapperClass(Mapper.class);                            job.setMapOutputKeyClass(Text.class);                            job.setMapOutputValueClass(IntWritable.class);                   }                   //默认的Map路由算法                   {                            /*对每条记录进行Hash操作决定该记录应该属于哪个分区,每个分区对应一个Reduce任务    ==>分区数等于作业Reduce的个数*/                            job.setPartitionerClass(HashPartitioner.class);                   }                   //Reduce默认配置                   {                            /*Reduce之前数据被使用key值进行过排序操作*/                            job.setReducerClass(Reducer.class);                            job.setNumReduceTasks(1);                   }                   //默认的输出配置                   {                            /*将键/值转换成字符串并使用制表符分开,然后一条一条的进行输出*/                            job.setOutputFormatClass(TextOutputFormat.class);                            job.setMapOutputKeyClass(IntWritable.class);                            job.setMapOutputValueClass(Text.class);                   }                    FileInputFormat.setInputPaths(job,new Path(args[0]));                   FileOutputFormat.setOutputPath(job,new Path(args[1]));                    if(!job.waitForCompletion(true))                            return;         }}

| WorldCout 实现,该例是完整的

public class WordCount {         publicstatic void main(String[] args) throws Exception {                   Configurationconf = new Configuration();                   Jobjob = Job.getInstance(conf, "WordCount");                   job.setInputFormatClass(CombineTextInputFormat.class);                   job.setJarByClass(cn.com.xiaofen.A.WordCount.class);                   job.setMapperClass(WordMaper.class);                   job.setReducerClass(WordReduce.class);                   job.setOutputKeyClass(Text.class);                   job.setOutputValueClass(IntWritable.class);                   CombineTextInputFormat.setInputPaths(job,new Path(args[0]));                   FileOutputFormat.setOutputPath(job,new Path(args[1]));                   if(!job.waitForCompletion(true))                            return;         }} class WordMaper extendsMapper<LongWritable, Text, Text, IntWritable> {         publicvoid map(LongWritable ikey, Text ivalue, Context context) throws IOException,InterruptedException {                   Textword = null;                   finalIntWritable one = new IntWritable(1);                   StringTokenizerst = new StringTokenizer(ivalue.toString());                   while(st.hasMoreTokens()) {                            word= new Text(st.nextToken());                            context.write(word,one);                   }         }} class WordReduce extends Reducer<Text,IntWritable, Text, IntWritable> {         @Override         protectedvoid reduce(Text key, Iterable<IntWritable> values, Context context)                            throwsIOException, InterruptedException {                   intsum = 0;                   for(IntWritable intw : values) {                            sum+= intw.get();                   }                   context.write(key,new IntWritable(sum));         }}

| 一个稍复杂的示例,TOPN实现

public class CarTopN {         publicstatic final String TOPN = "TOPN";         publicstatic void main(String[] args) throws Exception {                   Pathinput = new Path(args[0]);                   Pathoutput = new Path(args[1]);                   IntegerN = Integer.parseInt(args[2]);                   Configurationconf = new Configuration();                   //define the N                   conf.setInt(CarTopN.TOPN,N);                   Jobjob = Job.getInstance(conf, "CAR_Top10_BY_TGSID");                   job.setJarByClass(cn.com.zjf.MR_04.CarTopN.class);                   job.setInputFormatClass(CombineTextInputFormat.class);                   job.setMapperClass(CarTopNMapper.class);                   //not use                   //job.setCombinerClass(Top10Combine.class);                   job.setReducerClass(CarTopNReduce.class);                   job.setNumReduceTasks(1);                   job.setMapOutputValueClass(IntWritable.class);                   job.setOutputKeyClass(Text.class);                   job.setOutputValueClass(IntWritable.class);                   job.setPartitionerClass(ITGSParition.class);                   FileSystemfs = FileSystem.get(conf);                   //预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件                   {                            FileStatuschilds[] = fs.globStatus(input, new PathFilter() {                                     publicboolean accept(Path path) {                                               if(path.toString().endsWith(".writed")) {                                                        returntrue;                                               }                                               returnfalse;                                     }                            });                            Pathtemp = null;                            for(FileStatus file : childs) {                                     temp= new Path(file.getPath().toString().replaceAll(".writed",""));                                     if(fs.listStatus(temp)[0].getLen() > 0) {                                               FileInputFormat.addInputPath(job,temp);                                     }                            }                   }                   CombineTextInputFormat.setMaxInputSplitSize(job,67108864);                    //强制清理输出目录                   if(fs.exists(output)) {                            fs.delete(output,true);                   }                   FileOutputFormat.setOutputPath(job,output);                    if(!job.waitForCompletion(true))                            return;         }} class ITGSParition extendsPartitioner<Text, Text> {         @Override         publicint getPartition(Text key, Text value, int numPartitions) {                   return(Math.abs(key.hashCode())) % numPartitions;         }} class CarTopNMapper extendsMapper<LongWritable, Text, Text, IntWritable> {         @Override         protectedvoid map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,IntWritable>.Context context)                            throwsIOException, InterruptedException {                   Stringtemp = value.toString();                   if(temp.length() > 13) {                            temp= temp.substring(12);                            String[]items = temp.split(",");                            if(items.length > 10) {                                     //CarPlate As Key                                     try{                                               Stringtgsid = items[14].substring(6);                                               Integer.parseInt(tgsid);                                               context.write(newText(tgsid), new IntWritable(1));                                     }catch (Exception e) {                                               e.printStackTrace();                                     }                            }                   }         }} class CarTopNCombine extendsReducer<Text, IntWritable, Text, IntWritable> {         //有序Map 始终存储TOPN,不存储多余的数据         privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();         privateint N;          @Override         protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)                            throwsIOException, InterruptedException {                   Configurationconf = context.getConfiguration();                   N= conf.getInt(CarTopN.TOPN, 10);         }         @Override         protectedvoid reduce(Text key, Iterable<IntWritable> values,                            Reducer<Text,IntWritable, Text, IntWritable>.Context context) throws IOException,InterruptedException {                   Integerweight = 0;                   for(IntWritable iw : values) {                            weight+= iw.get();                   }                   tm.put(weight,key.toString());                   //保证只有TOPN                   if(tm.size() > N) {                            tm.remove(tm.firstKey());                   }         }         @Override         protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)                            throwsIOException, InterruptedException {                   //将最终的数据进行发射输出给下一阶段                   for(Integer key : tm.keySet()) {                            context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));                   }         }} // Top10核心计算方法// 尽量避免在Java集合中存储Hadoop数据类型,可能会出现奇怪的问题class CarTopNReduce extendsReducer<Text, IntWritable, Text, IntWritable> {         privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();         privateint N;          @Override         protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)                            throwsIOException, InterruptedException {                   Configurationconf = context.getConfiguration();                   N= conf.getInt(CarTopN.TOPN, 10);         }          @Override         protectedvoid reduce(Text key, Iterable<IntWritable> values,                            Reducer<Text,IntWritable, Text, IntWritable>.Context arg2) throws IOException,InterruptedException {                   Integerweight = 0;                   for(IntWritable iw : values) {                            weight+= iw.get();                   }                   tm.put(weight,key.toString());                   if(tm.size() > N) {                            tm.remove(tm.firstKey());                   }         }          @Override         protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)                            throwsIOException, InterruptedException {                   for(Integer key : tm.keySet()) {                            context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));                   }         }}

2 0