oozie4.3.0安装过程(支持python spark action)

来源:互联网 发布:淘宝网卡洛琳号船模型 编辑:程序博客网 时间:2024/06/05 09:02

oozie4.3.0安装过程


因本人工作环境需要用到oozie的spark action,oozie4.2.0不支持workflow工作流目录lib子目录下的jar包自动加载,故采用oozie-master的源码(版本为oozie 4.3.0-SNAPSHOT),
同时为使oozie spark action支持python文件,本人修改了若干源码,将在后面加以说明


1、安装环境

centos: 6.6jdk:    1.8.0_25maven:  3.3.9hadoop: 2.6.0spark:  1.6.0

为安装方便,使用root账户

2、打包
2.1)maven安装和配置
下载maven3.3.9

mkdir ~/download   cd ~/downloadwget http://apache.opencas.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gztar -zxvf apache-maven-3.3.9-bin.tar.gz -C /opt/mv /opt/apache-maven-3.3.9 /opt/mavan

将maven的bin目录加入path变量
在/etc/profile增加两行,
export MAVEN_HOME=/opt/mavenexport PATH=$PATH:$MAVEN_HOME/bin

保存退出后执行命令:
source /etc/profile


修改maven setting.xml, 使用开源中国的镜像,加快速度(该镜像不是很全,若发现某些库被屏蔽了建议修改mirrorOf项,这里不用关心)

<mirror>      <id>nexus-osc</id>      <name>OSChina Central</name>                                                                                   <url>http://maven.oschina.net/content/groups/public/</url>      <mirrorOf>*</mirrorOf></mirror>


2.2)下载安装pig
下载pig

cd ~/downloadwget http://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gztar -zxvf pig-0.13.0.tar.gz -C /opt/mv /opt/pig-0.13.0 /opt/pig

将pig的bin目录加入path变量
在/etc/profile增加两行,

export PIG_HOME=/opt/pigexport PATH=$PATH:$PIG_HOME/bin


2.3)下载oozie的master(为4.3.0-SNAPSHOT源码,解压)

cd ~/downloadgit https://github.com/apache/oozie.gitcd oozie

2.4)修改主目录中的pom.xml,有以下位置要改:

       <targetJavaVersion>1.8</targetJavaVersion>       <hadoop.version>2.6.0</hadoop.version>       <hadoop.majorversion>2</hadoop.majorversion>       <pig.version>0.13.0</pig.version>       <maven.javadoc.opts>-Xdoclint:none</maven.javadoc.opts>       <spark.version>1.6.0</spark.version>

2.5)修改源码(为保证oozie spark action支持python)
文件1:/root/download/oozie-master/oozie/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java
第568行else if (fileName.endsWith(".jar")) { // .jar files改为else if (fileName.endsWith(".jar")||fileName.endsWith(".py")) { // .jar files or .py files


文件2:
/root/download/oozie-master/oozie/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java
第221行:if (!path.startsWith("job.jar") && path.endsWith(".jar")) {改为if (!path.startsWith("job.jar") && (path.endsWith(".jar")||path.endsWith(".py"))) {

2.6)执行以下命令编译打包:

bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.version=2.6.0

2.7)打包完成,编译后的文件存放在distro/target文件夹下,我这里的文件名为oozie-4.3.0-SNAPSHOT-distro.tar.gz


3、安装oozie server

3.1)将oozie-4.3.0-SNAPSHOT-distro.tar.gz解压到/usr/local/下,并更名为oozie.

tar -zxvf distro/target/oozie-4.3.0-SNAPSHOT-distro.tar.gz -C /usr/local/mv /usr/local/oozie-4.3.0-SNAPSHOT-distro /usr/local/oozie

3.2)进入/usr/local/oozie目录下,解压share,example,client三个tar包,如下:

cd /usr/local/oozietar -zxvf oozie-client-4.2.0.tar.gztar -zxvf oozie-examples.tar.gztar -zxvf oozie-sharelib-4.2.0.tar.gz

3.3)在HDFS文件系统中创建一个/user/oozie的目录,并将share目录上传至HDFS中的/user/oozie目录下

hadoop fs -mkdir /user/ooziehadoop dfs -copyFromLocal /usr/local/oozie/share /user/ooziehadoop dfs -ls /user/oozie

3.4)在/usr/local/oozie目录下创建libext目录,复制hadoop目录lib子目录下的文件到/usr/local/oozie/libetx下

mkdir /usr/local/oozie/libetxcp ${HADOOP_HOME}/share/hadoop/*/*.jar libext/cp ${HADOOP_HOME}/share/hadoop/*/lib/*.jar libext/若有重复则按enter键略过为防止引用冲突,需要删除libext目录下的以下jar包:servlet-api-2.5.jarjasper-compiler-5.5.23.jarjasper-runtime-5.5.23.jarjsp-api-2.1.jar同时在/usr/local/oozie目录下查找jetty-all*.jarfind /usr/local/oozie -name jetty-all*.jar若发现则删除

3.5)将mysql-connector-java-5.1.38.jar(应对应系统的mysql版本)和ext2.2.zip拷贝至/usr/local/oozie/libext目录下

3.6)打war包,在/usr/local/oozie/bin下执行命令:

./oozie-setup.sh prepare-war

war文件存放在/usr/local/oozie/oozie-server/webapps目录下。

4、配置oozie
4.1)设置环境变量
    编辑/etc/profile文件,添加如下:

    export OOZIE_HOME=/usr/local/oozie    export CATALINA_HOME=/usr/local/oozie/oozie-server    export PATH=${CATALINA_HOME}/bin:${OOZIE_HOME}/bin:$PATH    export OOZIE_URL=http://localhost:11000    export OOZIE_CONFIG=/usr/local/oozie/conf

4.2)修改/usr/local/oozie/conf/oozie-site.xml文件:修改如下

<configuration>    <!-- Proxyuser Configuration -->    <property>        <name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>        <value>*</value>        <description>            List of hosts the '#USER#' user is allowed to perform 'doAs'            operations.            The '#USER#' must be replaced with the username o the user who is            allowed to perform 'doAs' operations.            The value can be the '*' wildcard or a list of hostnames.            For multiple users copy this property and replace the user name            in the property name.        </description>    </property>    <property>        <name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>        <value>*</value>        <description>            List of groups the '#USER#' user is allowed to impersonate users            from to perform 'doAs' operations.            The '#USER#' must be replaced with the username o the user who is            allowed to perform 'doAs' operations.            The value can be the '*' wildcard or a list of groups.            For multiple users copy this property and replace the user name            in the property name.        </description>    </property>    <property>        <name>oozie.db.schema.name</name>        <value>oozie</value>        <description>        Oozie DataBase Name        </description>    </property>        <property>        <name>oozie.service.JPAService.create.db.schema</name>        <value>false</value>        <description>        </description>    </property>        <property>        <name>oozie.service.JPAService.jdbc.driver</name>        <value>com.mysql.jdbc.Driver</value>        <description>                JDBC driver class.        </description>    </property>        <property>        <name>oozie.service.JPAService.jdbc.url</name>        <value>jdbc:mysql://localhost:3306/${oozie.db.schema.name}</value>        <description>                JDBC URL.        </description>    </property>        <property>        <name>oozie.service.JPAService.jdbc.username</name>        <value>oozie</value>        <description>        DB user name.        </description>    </property>        <property>        <name>oozie.service.JPAService.jdbc.password</name>        <value>oozie</value>        <description>                DB user password.                IMPORTANT: if password is emtpy leave a 1 space string, the service trims the value,                if empty Configuration assumes it is NULL.        </description>    </property>    <property>        <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>        <value>*=/usr/local/hadoop/etc/hadoop</value>    </property>    <property>        <name>oozie.service.HadoopAccessorService.action.configurations</name>        <value>*=/usr/local/hadoop/etc/hadoop</value>    </property>    <property>         <name>oozie.service.SparkConfigurationService.spark.configurations</name>         <value>*=/usr/local/spark/conf</value>    </property>    <property>         <name>oozie.service.WorkflowAppService.system.libpath</name>         <value>/user/oozie/share/lib</value>    </property>    <property>        <name>oozie.use.system.libpath</name>        <value>true</value>        <description>                Default value of oozie.use.system.libpath. If user haven't specified =oozie.use.system.libpath=                in the job.properties and this value is true and Oozie will include sharelib jars for workflow.        </description>    </property>    <property>        <name>oozie.subworkflow.classpath.inheritance</name>        <value>true</value>    </property></configuration>

4.3)配置mysql数据库,并生成oozie数据库脚本文件(将会在/usr/local/oozie/bin目录下生成oozie.sql文件)

    mysql -u root -proot       (进入mysql数据库,用户名和密码根据实际情况修改)    create database oozie;    (创建名称为oozie的数据库)    grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';     (设置oozie数据库的访问全选,创建用户名为oozie,密码为oozie的用户)    grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';             (设置oozie数据库的访问权限)    FLUSH PRIVILEGES;
在/usr/local/oozie/bin目录下执行以下命令:

./ooziedb.sh create -sqlfile oozie.sql
接着执行如下命令,执行oozie数据库脚本文件,这将在oozie数据库中生成与oozie相关的数据表
./oozie-setup.sh db create -run  -sqlfile /usr/local/oozie/bin/oozie.sql


4.4)在hadoop集群的namenode上修改中的core-site.xml文件:

<property>    <name>hadoop.proxyuser.oozie.hosts</name>    <value>*</value> </property> <property>    <name>hadoop.proxyuser.oozie.groups</name>    <value>*</value> </property>

修改完之后,不需要重启hadoop集群,执行以下命令即可:

hdfs dfsadmin -refreshSuperUserGroupsConfigurationyarn rmadmin -refreshSuperUserGroupsConfiguration

4.5)执行以下命令,启动oozie成功
cd /usr/local/ooziebin/oozied.sh start

5、后续
5.1)修改oozie源码,使得oozie spark action支持python
现象:提交python action任务时会发现系统不能找到python文件
原因:oozie是通过mapreduce2(yarn)上的executor节点上启动spark driver的,在启动mapreduce任务时,并没有将lib目录下的python文件加载到yarn的分布式文件列表中,
所以在启动spark任务时会报找不到python文件,源码的修改就是为了将lib目录下的python文件加入yarn的分布式文件列表中,同时在启动spark的时候能在文件时文件列表中访问;

5.2)pi文件找到,但是仍报pyspark模块找不到
现象:通过分析是SPARK_HOME系统变量找不到,但是在yarn的所有节点上的/etc/profile、/usr/local/spark/conf/spark-env.sh上均部署了SPARK_HOME,但是在执行时确不能读到SPARK_HOME
原因:/etc/profile在远程访问中无法被读到,而/usr/local/spark/conf/spark-env.sh只在启动spark任务时会被读到,而oozie spark action执行首先是启动一个mapreduce2任务(yarn),
在mapreduce2的executor节点上启动spark driver任务,这时需要加载python包,而因为没有读取到spark-env.sh, 所以不会读到SPARK_HOME系统常量
修改:在yarn-env.sh中增加系统变量的设置:

export SPARK_HOME=/usr/local/spark

1 0
原创粉丝点击