Hadoop 执行多个job
来源:互联网 发布:阿里天池大数据比赛 编辑:程序博客网 时间:2024/04/29 06:55
书上说的不清晰透彻,下面是在StackOverflow上的一个方案,我觉得很好:(1) Cascading jobsCreate the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:
- JobClient.run(job1);
(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.Then create two Job objects with jobconfs as parameters:
- JobClient.run(job2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
- Job job1=new Job(jobconf1);
- Job job2=new Job(jobconf2);
(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, u can use only one reducer but any number of mappers before or after it.下面对这段话进行一定的展开。(注:以下标为原文是为了在日记中进行突出显示,非原文字句,请作者及读者见谅,如果存在版权问题请指出~)
- JobControl jbcntrl=new JobControl("jbcntrl");
- jbcntrl.addJob(job1);
- jbcntrl.addJob(job2);
- job2.addDependingJob(job1);
- jbcntrl.run();
对于A,这个简单,因为Hadoop已经提供了JobClient这个类,我们只需要构建多个Job就行,注意使用JobClient的runJob(JobConf)方法,别一不小心用了submitJob(JobConf),文档的说明是:runJob(JobConf) : submits the job and returns only after the job has completed.submitJob(JobConf) : only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling decisions.runJob会等到job执行结束才返回。于是可以很简洁的写出类似下面的代码:MapReduce的Map与Reduce进行链接可能存在多重关系,按照本书的说明,主要考虑四种: A、顺序级联,类似与Unix/Linux管道:Map1/Reduce1 -> Map2/Reduce2 -> ... B、具有比较复杂的依赖关系,如:Map1/Reduce1 && Map2/Reduce2 -> Map3/Reduce3,倒立的树形结构; C、与预处理及后阶段处理相关的组合/链接:Map1-> Map2-> Map3 -> Reduce -> Map4 -> Map5 D、对于B,先贴张图说明一下到底是啥:
- /* some configuration code goes here */
- Job job1=new Job(jobconf1);
- JobClient.run(job1);
- // ...
- Job job2=new Job(jobconf2);
- JobClient.run(job2);
(说明:本图片来自http://www.cnblogs.com/xuqiang/archive/2011/06/05/2073155.html,非常感谢)这种纠结的问题就是JobControl的用武之处了,请见下面代码:
- Job job1=new Job(jobconf1);
- Job job2=new Job(jobconf2);
- Job job3=new Job(jobconf3);
- Job job4=new Job(jobconf4);
- Job job5=new Job(jobconf5);
- JobControl jbcntrl=new JobControl("MyJobCtrl"); // 传入一个字符串作为名字
- jbcntrl.addJob(job1);
- jbcntrl.addJob(job2);
- jbcntrl.addJob(job3);
- jbcntrl.addJob(job4);
- jbcntrl.addJob(job5);
- job2.addDependingJob(job1); // 表示job2依赖于job1,job1没有完成,job2就不会启动
- job4.addDependingJob(job3);
- job5.addDependingJob(job2);
- job5.addDependingJob(job4);
- jbcntrl.run();
- Hadoop 执行多个job
- Hadoop 多个job
- hadoop 一个Job多个MAP与REDUCE的执行
- hadoop 一个Job多个MAP与REDUCE的执行
- hadoop 一个Job多个MAP与REDUCE的执行
- spring batch job 详细配置以及多个job时如何根据参数执行指定job
- hadoop 多个Mapper和Reducer的Job
- Hadoop: MapReduce2多个job串行处理 复杂的MapReduce处理中,往往需要将复杂的处理过程,分解成多个简单的Job来执行,第1个Job的输出做为第2个Job的输入,相互之间有一
- Hadoop Job 按组分资源池执行
- hadoop如何杀掉正在执行的job
- Hadoop执行MR Job的基本过程
- kttle 新建作业执行多个转换job按并行和顺序执行
- Hadoop多Job并行处理
- hadoop获取job信息,maptask,reducetask获取信息,hadoop监控job执行状况(cdh4.2 )
- jenkins配置多job执行
- Hadoop实战演练:搜索数据分析----多个不同的Job进行串连(4)
- Hadoop 里MapReduce里 实现多个job任务 包含(迭代式、依赖式、链式)
- Hadoop 里MapReduce里 实现多个job任务 包含(迭代式、依赖式、链式)
- ubuntu在右键添加 终端 命令
- 如何成为一名黑客[zz]
- 利用javascript将网站加入收藏夹
- android源码关联,android-support-v4.jar 源码关联
- Poj 1789 Truck History
- Hadoop 执行多个job
- hdu1593(find a way to escape)
- iOS开发那些事-移动平台架构设计
- org.hibernate.MappingException
- 人生中的第一篇博客
- Linux下安装jdk1.6
- myeclipse 9.1优化(eclipse与myeclipse编码设置)
- DNA Sorting
- 正则表达式 进阶(二)-- 回溯引用、前后查找、嵌入条件