HDPCD-Java-复习笔记(20)
来源:互联网 发布:工业设计网站 知乎 编辑:程序博客网 时间:2024/06/06 09:44
Defining Workflow
Overview of Workflow
Workflow in Hadoop refers to the scheduling and coordination of MapReduce jobs.
Orchestration of MapReduce jobs can be accomplished in several ways, including:
Linear Chain of MapReduce Jobs
Use the return value of waitForCompletion to determine the success of each job in the chain
JobControl
A class in the MapReduce API that coordinates multiple jobs
Apache Oozie
A workflow scheduler system for managing jobs and also for scheduling recurring jobs
Linear Chain
The linear chain workflow is straightforward. Start a job in a run method of a Tool instance, wait for the job to complete, and then start subsequent jobs in the same run method. Continue until all jobs are complete.
For example:Job job1 = Job.getInstance(getConf(), "CreateBloomFilter");boolean job1success = job1.waitForCompletion(true);if(!job1success) { System.out.println("The CreateBloomFilter job failed!"); return -1;}Job job2 = Job.getInstance(conf, "FilterStocksJob");boolean job2success = job2.waitForCompletion(true);if(!job2success) { System.out.println("The FilterStocksJob failed!"); return -1;}return 1;
The JobControl Class
The org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl class provides the ability to coordinate MapReduce jobs and define dependencies.
Instead of invoking run on a Job instance, you add a Job instance to a JobControl instance, along with defining any dependencies for the Job.
For example:
JobControl control = new JobControl("bloom");ControlledJob cjob1 = new ControlledJob(job1, null);List<ControlledJob> dependentJobs = new ArrayList<ControlledJob>();dependentJobs.add(cjob1);ControlledJob cjob2 = new ControlledJob(job2, dependentJobs);control.addJob(cjob1);control.addJob(cjob2);new Thread(control).start(); while (!control.allFinished()) { Thread.sleep(1000);}control.stop();
• The addJob method is used to add a ControlledJob to a JobControl.
• A ControlledJob has two main components: a Job, and a List of that Job’s dependencies.
• A Job will not be started until all its dependencies are in the SUCCESS state.
• The JobControl class implements Runnable. To run the jobs in the JobControl instance, wrap it in a Thread and then start the Thread.
ControlledJob has a static inner enum named State for the various states that a ControlledJob can be in. The enum values are:
•WAITING
• READY
• RUNNING
• SUCCESS
• FAILED
• DEPENDENT_FAILED
Use the getJobState method to retrieve the state of a ControlledJob.
Overview of Oozie
Oozie provides a framework for coordinating and scheduling Hadoop jobs. Oozie is not restricted to just MapReduce jobs; Oozie can schedule Pig,Hive, Sqoop, streaming jobs, and even Java programs.
Oozie has two main capabilities:
Oozie Workflow-- A collection of actions, defined in a workflow.xmlfile
Oozie Coordinator -- A recurring workflow, defined in acoordinator.xml file
Oozie is a Java web application that runs in a Tomcat instance. Oozie is run as a service, and workflows are started using the oozie command.
Defining an Oozie Workflow
An Oozie workflow consists of a workflow.xml file and the files required by the workflow.The workflow is put into HDFS with the following directory structure:
- /appdir/workflow.xml
- /appdir/config-default.xml
- /appdir/lib/files.jar
Each workflow can also have a job.properties file (not put into HDFS) for job-specific properties.
A workflow definition consists of two main entries:
Control flow nodes --- Jobs can contain fork or join nodes. Afork node splits one path into multiple paths. A join node waits until every path of a previous fork node arrives to it.
Action nodes --- For executing a job or task.
Consider this example:
<?xml version="1.0" encoding="utf-8"?><workflow-app xmlns="uri:oozie:workflow:0.2" name="payroll-workflow"> <start to="payroll-job"/> <action name="payroll-job"> <map-reduce> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/payroll/result"/> </prepare> <configuration> <property> <name>mapreduce.job.queuename</name> <value>${queueName}</value> </property> <property> <name>mapred.mapper.new-api</name> <value>true</value> </property> <property> <name>mapred.reducer.new-api</name> <value>true</value> </property> <property> <name>mapreduce.job.map.class</name> <value>payroll.PayrollMapper</value> </property> <property> <name>mapreduce.job.reduce.class</name> <value>payroll.PayrollReducer</value> </property> <property> <name>mapreduce.job.inputformat.class</name> <value>org.apache.hadoop.mapreduce.lib.input.TextInputFormat</value> </property> <property> <name>mapreduce.job.outputformat.class</name> <value>org.apache.hadoop.mapreduce.lib.output.NullOutputFormat</value> </property> <property> <name>mapreduce.job.output.key.class</name> <value>payroll.EmployeeKey</value> </property> <property> <name>mapreduce.job.output.value.class</name> <value>payroll.Employee</value> </property> <property> <name>mapreduce.job.reduces</name> <value>20</value> </property> <property> <name>mapreduce.input.fileinputformat.inputdir</name> <value>${nameNode}/user/${wf:user()}/payroll/input</value> </property> <property> <name>mapreduce.output.fileoutputformat.outputdir</name> <value>${nameNode}/user/${wf:user()}/payroll/result</value> </property> <property> <name>taxCode</name> <value>${taxCode}</value> </property> </configuration> </map-reduce> <ok to="compute-tax"/> <error to="fail"/> </action> <action name="compute-tax"> <map-reduce> <!-- define compute-tax job --> </map-reduce> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
•Every workflow must define a <start> and <end>.
•This workflow has two action nodes: payroll-job and compute-tax.
•Notice a workflow looks almost identical to a run method of a MapReduce job, except that the job properties are specified in XML.
•The <delete> function is a convenient way to delete an existing output folder.
Parameters use the${}syntax and represent values defined outside of workflow.xml. For example, ${resourceManager} is the server name and port number where the ResourceManager is running. Instead of hard-coding this value, you define it in an external properties file (the job.properties file).
wf:user(), which returns the name of the user running the job, and wf:lastErrorNode(), which returns the NodeManager where the mostrecent error occurred.
Using the New MapReduce API in an Oozie Workflow
•<property>
• <name>mapred.mapper.new-api</name>
• </property>
• <property>
• <name>mapred.reducer.new-api</name>
• <value>true</value>
• </property>
Pig Actions
Here is an example of a simple workflow that only contains a single Pig script asone of its actions:
<?xml version="1.0" encoding="utf-8"?><workflow-app xmlns="uri:oozie:workflow:0.2" name="whitehouse-workflow"> <start to="transform_whitehouse_visitors"/> <action name="transform_whitehouse_visitors"> <pig> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="wh_visits"/> </prepare> <script>whitehouse.pig</script> </pig> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
<?xml version="1.0" encoding="utf-8"?><action name="find_congress_visits"> <hive xmlns="uri:oozie:hive-action:0.5"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="congress_visits"/> </prepare> <job-xml>hive-site.xml</job-xml> <configuration> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> </configuration> <script>congress.hive</script> </hive> <ok to="end"/> <error to="fail"/> </action>
Oozie has a command line tool named oozie for submitting and executing workflows. The command looks like:
Note that the workflow is typically deployed in HDFS, while job.properties, which contains the properties passed into the workflow, is typically kept on the local filesystem.
The command above does not specify which Oozie Workflow to execute. This isspecified by the oozie.wf.application.path property:
Here is an example of a job.properties file:
- oozie.wf.application.path=hdfs://node:8020/path/to/app
- resourceManager=node:8050
- #Hadoop fs.defaultFS
- nameNode=hdfs://node:8020/
- #Hadoop mapreduce.job.queuename
- queueName=default
- #Application-specific properties
- taxCode=2012
Fork and Join Nodes
Oozie has fork and join nodes for controlling workflow. For example:
<fork name="forking"> <path start="firstparalleljob"/> <path start="secondparalleljob"/></fork><action name="firstparallejob"> <map-reduce>... </map-reduce> <ok to="joining"/> <error to="kill"/></action><action name="secondparalleljob"> <map-reduce>... </map-reduce> <ok to="joining"/> <error to="kill"/></action><join name="joining" to="nextaction"/>
Defining an Oozie Coordinator Job
The Oozie Coordinator component enables recurring Oozie Workflows to be defined.These recurring jobs can be triggered by two types of events:
Time
Similar to a cron job
Data Availability
The job triggers when a specified directory is created
An Oozie Coordinator job consists of two files:
coordinator.xml
The definition of the Coordinator application
coordinator.properties
For defining the job’s properties
Schedule a Job Based on Time
Consider this example of a coordinator.xml file.
<?xml version="1.0" encoding="utf-8"?><coordinator-app xmlns="uri:oozie:coordinator:0.1"
name="tf-idf" frequency="1440" start="2013-01-01T00:00Z" end="2013-12-31T00:00Z" timezone="UTC"> <action> <workflow> <app-path>hdfs://node:8020/root/java/tfidf/workflow</app-path> </workflow> </action> </coordinator-app>
The frequency is in minutes, so this job runs once a day.
You submit an Oozie Coordinator job similar to submitting a Workflow job:
• # oozie job -config coordinator.properties -run
The coordinator.properties file contains the path to thecoordinator app:
oozie.coord.application.path=hdfs://node:8020/path/to/appSchedule a Job Based on Data Availability
The following Coordinator application triggers a Workflow job when a directory named hdfs://node :8020 /job/result/ gets created:
<?xml version="1.0" encoding="utf-8"?><coordinator-app xmlns="uri:oozie:coordinator:0.1"
name="file_check"
frequency="1440" start="2012-01-01T00:00Z" end="2015-12-31T00:00Z" timezone="UTC"> <datasets> <dataset name="input1"> <uri-template>hdfs://node:8020/job/result/</uri-template> </dataset> </datasets> <action> <workflow> <app-path>hdfs://node:8020/myapp/</app-path> </workflow> </action> </coordinator-app>
• If the folder hdfs://node :8020 /job/result/ exists, the <action> executes, which in this example is an Oozie Workflow deployed in the hdfs://node:8020/myapp folder.
• The use-case scenario for this example posits that a separate MapReduce job executes once a day at an unspecified time. When that job runs, it deletes the hdfs://node :8020/job/result directory, and then creates a new one, which triggers the Coordinator to run.
• This Coordinator runs once a day, and if /job/result exists, then the /myapp workflow will execute.
- HDPCD-Java-复习笔记(20)
- HDPCD-Java-复习笔记(1)
- HDPCD-Java-复习笔记(2)
- HDPCD-Java-复习笔记(3)-lab
- HDPCD-Java-复习笔记(4)
- HDPCD-Java-复习笔记(5)
- HDPCD-Java-复习笔记(6)
- HDPCD-Java-复习笔记(7)- lab
- HDPCD-Java-复习笔记(8)- lab
- HDPCD-Java-复习笔记(9)-lab
- HDPCD-Java-复习笔记(10)-lab
- HDPCD-Java-复习笔记(11)
- HDPCD-Java-复习笔记(12)
- HDPCD-Java-复习笔记(13)- lab
- HDPCD-Java-复习笔记(14)- lab
- HDPCD-Java-复习笔记(15)
- HDPCD-Java-复习笔记(16)
- HDPCD-Java-复习笔记(17)
- redis(5)——RDB持久化
- 机器学习-模型评估与选择
- 【日常学习笔记】PHP上传文件时权限问题~
- Python自学第一课
- oracle 一次删除多张表
- HDPCD-Java-复习笔记(20)
- MarkDown 常用语法延展
- 二分查找
- java基础知识
- python 中赋值 copy() 与 ' = ' 号以及比较符'=='号 与 'is'
- 10.19(周四)
- Linux教程【7】-权限管理命令
- 解题笔记——Noip2012铺地毯
- 用python进行图片处理和特征提取