Apache Oozie - the workflow scheduler for hadoop
来源:互联网 发布:linux 设置ftp根目录 编辑:程序博客网 时间:2024/05/07 22:18
One of Oozie’s strengths is that it was custom built from the ground up for Hadoop. This not only means that Oozie works well on Hadoop, but that the authors of Oozie had an opportunity to build a new system incorporating much of their knowledge about other legacy workflow systems. Although some users view Oozie as just a workflow system, it has evolved into something more than that. The ability to use data availability and time-based triggers to schedule workflows via the Oozie coordinator is as important to today’s users as the workflow. The higher-level concept of bundles, which enable users to package multiple coordinators into complex data pipelines, is also gaining a lot of traction as applications and pipelines moving to Hadoop are getting more complicated.
Recurrent Problem
Hadoop, Pig, Hive, and many other projects provide the foundation for storing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Pig, or Hive job.Multiple MapReduce, Pig, or Hive jobs often need to be chained together, producing and consuming intermediate data and coordinating their flow of execution.
As developers started doing more complex processing using Hadoop, multistage Hadoop jobs became common. This led to several ad hoc solutions to manage the execution and interdependency of these multiple Hadoop jobs. Some developers wrote simple shell scripts to start one Hadoop job after the other. Others used Hadoop’s JobControl class, which executes multiple MapReduce jobs using topological sorting. One development team resorted to Ant with a custom Ant task to specify their MapReduce and Pig jobs as dependencies of each other—also a topological sorting mechanism. Another team implemented a server-based solution that ran multiple Hadoop jobs using one thread to execute each job.
As these solutions started to be widely used, several issues emerged. It was hard to track errors and it was difficult to recover from failures. It was not easy to monitor progress. It complicated the life of administrators, who not only had to monitor the health of the cluster but also of different systems running multistage jobs from client machines. Developers moved from one project to another and they had to learn the specifics of the custom framework used by the project they were joining. Different organizations were using significant resources to develop and support multiple frameworks for accomplishing basically the same task.
A Common Solution: Oozie
It was clear that there was a need for a general-purpose system to run multistage Hadoop jobs with the following requirements:
- It should use an adequate and well-understood programming model to facilitate its adoption and to reduce developer ramp-up time.
- It should be easy to troubleshot and recover jobs when something goes wrong.
- It should be extensible to support new types of jobs.
- It should scale to support several thousand concurrent jobs.
- Jobs should run in a server to increase reliability.
- It should be a multitenant service to reduce the cost of operation.
A Simple Oozie Job
We’ll create an Oozie workflow application named identity-WF that runs an identity MR job, and the job just echoes its input as output and does nothing else.
$ git clone https://github.com/oozie-book/examples.git$ cd examples/chapter-01/identity-wf$ mvn clean assembly:single...[INFO] BUILD SUCCESS...$ tree target/example
The workflow.xml file contains the workflow definition of application identity-WF.
Why XML?
By using XML, Oozie application developers can use any XML editor tool to author their Oozie application. The Oozie server uses XML libraries to parse and validate the correctness of an Oozie application before attempting to use it, significantly simplifying the logic that processes the Oozie application definition. The same holds true for systems creating Oozie applications on the fly.
identity-WF Oozie workflow XML (workflow.xml)
<workflow-app xmlns="uri:oozie:workflow:0.4" name="identity-WF"> <parameters> <property> <name>jobTracker</name> </property> <property> <name>nameNode</name> </property> <property> <name>exampleDir</name> </property> </parameters> <start to="identity-MR"/> <action name="identity-MR"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${exampleDir}/data/output"/> </prepare> <configuration> <property> <name>mapred.mapper.class</name> <value>org.apache.hadoop.mapred.lib.IdentityMapper</value> </property> <property> <name>mapred.reducer.class</name> <value>org.apache.hadoop.mapred.lib.IdentityReducer</value> </property> <property> <name>mapred.input.dir</name> <value>${exampleDir}/data/input</value> </property> <property> <name>mapred.output.dir</name> <value>${exampleDir}/data/output</value> </property> </configuration> </map-reduce> <ok to="success"/> <error to="fail"/> </action> <kill name="fail"> <message>The Identity Map-Reduce job failed!</message> </kill> <end name="success"/></workflow-app>
The example application consists of a single file, workflow.xml. We need to package and deploy the application on HDFS before we can run a job.
hanying@master$ hdfs dfs -mkdir -p /user/hanyinghanying@master$ hdfs dfs -put target/example/ch01-identity ch01-identity //会存到/user/hanying目录下hanying@master$ hdfs dfs -ls -R /user/hanying/ch01-identity #同$ hdfs dfs -ls -R ch01-identity
Before we can run the Oozie job, we need a job.properties file in our local filesystem that specifies the required parameters for the job and the location of the application package in HDFS:
$export OOZIE_URL=http://localhost:11000/oozie$oozie job -run -config target/example/job.properties
error:
SLF4J: Class path contains multiple SLF4J bindings....Error: HTTP error code: 500 : Internal Server Error
Solution of SLF4J:http://www.slf4j.org/codes.html#multiple_bindings
remove the conflict jar.
Solution of Error: HTTP error code: 500 : Internal Server Error
oozie-error.log
http://oozie.apache.org/docs/4.2.0/AG_Install.html #To use a Self-Signed Certificate
$sudo keytool -genkeypair -alias tomcat2 -keyalg RSA -dname "CN=localhost" -storepass password -keypass password$sudo keytool -exportcert -alias tomcat2 -file ./certificate.cert#生成certificate.cert文件$sudo keytool -import -alias tomcat -file certificate.cert#是证书已添加到密钥库中
Error:
2016-05-20 09:55:49,729 ERROR ShareLibService:517 - SERVER[master] org.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with oozie-setup.sh and issue the 'oozie admin' CLI command to update the shareliborg.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with oozie-setup.sh and issue the 'oozie admin' CLI command to update the sharelib
Fix it:
#stop oozie#edit oozie-site.xml<property> <name>oozie.service.HadoopAccessorService.hadoop.configurations</name> <value>*=/home/hanying/hadoop/etc/hadoop</value></property>hanying@master:/usr/local/src/oozie-4.2.0$ sudo -u oozie bin/oozie-setup.sh sharelib create -fs hdfs://master:8020 -locallib share/
Error:
hanying@master:/usr/local/src/oozie-4.2.0$ sudo -u oozie bin/oozie-setup.sh sharelib create -fs hdfs://master:8020 -locallib share/ setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"the destination path for sharelib is: /user/oozie/share/lib/lib_20160520173926Error: User: oozie is not allowed to impersonate oozieStack trace for the error was (for debug purposes):
Fix it:
#1.Add the following content to core-site.xml<property> <name>hadoop.proxyuser.oozie.hosts</name> <value>*</value></property><property> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value></property>#2.Restart all.
make sure your hadoop service is started
The default Hadoop ports are as follows: (HTTP ports, they have WEB UI):
Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst servers. The following table presents the ports and protocols that Hadoop uses. This table does not include the HTTP ports mentioned above.
- Apache Oozie - the workflow scheduler for hadoop
- Apache Oozie Workflow Scheduler for Hadoop
- Next Generation of Apache Hadoop MapReduce – The Scheduler
- Hadoop: Hadoop oozie main sub workflow.xml configuration
- oozie 笔记Workflow Notifications
- Hive In Oozie Workflow
- Hive In Oozie Workflow
- Improvements in the Hadoop YARN Fair Scheduler
- oozie workflow.xml 综合案例
- Oozie workflow.xml 视图解析
- Oozie Error: E0902 : E0902: Exception occured: [org.apache.hadoop.ipc.RemoteException: User: oozie i
- Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
- How-to: Use the ShareLib in Apache Oozie (CDH 5)
- How-to: Use the ShareLib in Apache Oozie
- Cloudera Manager 中Oozie 配置HIVE workflow
- Oozie的workflow的xml简单例子
- Enable the Web Service Testing Interface for Adobe LiveCycle Workflow
- Hadoop Oozie 学习笔记
- mongodb嵌套集合数据的查询
- Android中配置和使用Google Map服务
- 《JAVA:异常的处理》NumberFormatException异常
- iOS中创建静态库
- 数据密集型系统架构设计
- Apache Oozie - the workflow scheduler for hadoop
- JS通过正则限制 input 输入框只能输入整数、小数(金额或者现金)
- 多边形游戏
- 久违的月赛之二
- Dockerfile最佳实践(二)
- 嵌入式Qt开发环境搭建及移植到开发板----Qt学习笔记
- 数据库--ORCAL-day03
- Swift 原始指针
- 映射配置文件中关联关系——一对一