oozie定时调用sqoop从oracle中增量导入数据至hive表

来源:互联网 发布:买备案域名要多少钱 编辑:程序博客网 时间:2024/06/06 21:41

说明:
在大数据开发中,由于数据量非常的大,有时候需要将关系型数据库(Oracle/Postgresql/Mysql等)中的表导入到hdfs中进行分布式存储。本文就oozie定时调用sqoop从oracle库中增量导入数据至hive表涉及到的问题做了详细的讨论,仅供参考。
1/ 定义sqoop job
要实现增量更新,首先需要创建相应的sqoop job,sqoop每次完成action之后都会将最新的last_value自动保存到sqoop的metastore中,用于替换下一次调用job时的last_value值,从而实现增量更新。

sqoop job \--meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop \--create cust_add_job \--import --connect jdbc:oracle:thin:@10.45.6.34:1521:wsmp \--username 123 \--password 123  -m 1 \--table CUST \--hive-import --hive-table cust \--incremental append --check-column CUST_ID --last-value 0

2/ 定义workflow.xml
应该说oozie使用自带的sqoop/hive包,而sqoop和hive安装目录下的配置文件oozie是读不到的,所以需要在workflow.xml指定相关的配置。

<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-to-hive">    <start to="sqoop2hive"/>    <action name="sqoop2hive">        <sqoop xmlns="uri:oozie:sqoop-action:0.2">            <job-tracker>${jobTracker}</job-tracker>            <name-node>${nameNode}</name-node>            <configuration>                <property>                    <name>sqoop.metastore.client.record.password</name>                    <value>true</value>                </property>                <property>                    <name>hive.metastore.local</name>                    <value>true</value>                </property>                <property>                  <name>hive.metastore.uris</name>                  <value>thrift://10.45.6.34:9083</value>                </property>                <property>                  <name>javax.jdo.option.ConnectionURL</name>                  <value>jdbc:postgresql://10.45.40.174:5432/dvsp_hive?createDatabaseIfNotExist=true</value>                </property>                <property>                  <name>javax.jdo.option.ConnectionDriverName</name>                  <value>org.postgresql.Driver</value>                </property>                <property>                  <name>javax.jdo.option.ConnectionUserName</name>                  <value>hive</value>                </property>                <property>                  <name>javax.jdo.option.ConnectionPassword</name>                  <value>hive</value>                </property>                <property>                  <name>hive.metastore.warehouse.dir</name>                  <value>/user/hive/warehouse</value>                </property>            </configuration>            <command>job --meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop --exec cust_add_job</command>        </sqoop>        <ok to="end"/>        <error to="fail"/>    </action>    <kill name="fail">        <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>    </kill>    <end name="end"/></workflow-app>

3/ 定义coordinator.xml
这里定义每6分钟执行一次sqoop2hive的增量更新。

<coordinator-app name="sqoop2hive-coord" frequency="${coord:minutes(6)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2">    <action>        <workflow>            <app-path>${workflowAppUri}</app-path>            <configuration>                <property>                    <name>jobTracker</name>                    <value>${jobTracker}</value>                </property>                <property>                    <name>nameNode</name>                    <value>${nameNode}</value>                </property>            </configuration>        </workflow>    </action></coordinator-app>

4/ 定义job.properties
oozie使用的是UTC时间,比北京时间慢8小时。其中oozie.use.system.libpath置为true是必需的。

nameNode=hdfs://10.45.6.34:9000jobTracker=10.45.6.34:8032oozie.libpath=${nameNode}/user/${user.name}/share/lib/lib_20161121154957/sqoopoozie.use.system.libpath=truestart=2016-12-06T07:30Zend=2016-12-06T08:00Zoozie.coord.application.path=${nameNode}/user/${user.name}/apps/sqoop2hive_corworkflowAppUri=${nameNode}/user/${user.name}/apps/sqoop2hive_cor

5/ 一些问题
(1)当oozie执行失败后,想在web端查看详细的错误信息,但是
Console URL:http://centos34:8088/proxy/application_1479784816688_0014/无法访问
可能的原因:没有配置hive-site.xml中的yarn.resourcemanager.webapp.address项。
也可能是IP地址和主机名没有对应:修改hosts文件,添加相应的IP地址和主机名;或者直接将host写成IP的形式。

(2)时间不同步问题
Error Code: JA018
Error Message: begin > end in range (begin, end): (1480312665887, 1480312663350)
该问题是集群的时间同步问题,由hadoop集群的各节点的时间不同步导致,可以使用ntp服务进行时间同步,具体的ntp时间同步方法请百度。

(3)单纯的通过sqoop将oracle库中的数据导入到hive,能够成功,但是通过oozie来执行这一过程却导致失败:
Error Code: 1
Error Message: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
在web端查看HDFS,可以看到数据成功导入到hdfs,但并没有成功导入到hive,执行失败。相关的oozie.log:

2016-12-06 16:22:23,088  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] Start action [0000001-161206155113885-oozie-root-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]2016-12-06 16:22:23,089  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] [***0000001-161206155113885-oozie-root-W@:start:***]Action status=DONE2016-12-06 16:22:23,089  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] [***0000001-161206155113885-oozie-root-W@:start:***]Action updated in DB!2016-12-06 16:22:23,274  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] Start action [0000001-161206155113885-oozie-root-W@sqoop2hive] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]2016-12-06 16:22:23,611  WARN JobSubmitter:151 - SERVER[centos34] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.2016-12-06 16:22:23,629  WARN JobSubmitter:259 - SERVER[centos34] No job jar file set.  User classes may not be found. See Job or Job#setJar(String).2016-12-06 16:22:24,697  INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] checking action, hadoop job ID [job_1480907021185_0025] status [RUNNING]2016-12-06 16:22:24,700  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] [***0000001-161206155113885-oozie-root-W@sqoop2hive***]Action status=RUNNING2016-12-06 16:22:24,700  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] [***0000001-161206155113885-oozie-root-W@sqoop2hive***]Action updated in DB!2016-12-06 16:22:31,198  INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService]2016-12-06 16:22:31,200  INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Running coordinator status service from last instance time =  2016-12-06T08:21Z2016-12-06 16:22:31,207  INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Running bundle status service from last instance time =  2016-12-06T08:21Z2016-12-06 16:22:31,210  INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService]2016-12-06 16:22:31,342  INFO PauseTransitService:541 - SERVER[centos34] USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.PauseTransitService]2016-12-06 16:22:31,357  INFO PauseTransitService:541 - SERVER[centos34] USER[-] GROUP[-] Released lock for [org.apache.oozie.service.PauseTransitService]2016-12-06 16:23:09,629  INFO CallbackServlet:541 - SERVER[centos34] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] callback for action [0000001-161206155113885-oozie-root-W@sqoop2hive]2016-12-06 16:23:09,920  INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] checking action, hadoop job ID [job_1480907021185_0025] status [RUNNING]2016-12-06 16:23:13,860  INFO CallbackServlet:541 - SERVER[centos34] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] callback for action [0000001-161206155113885-oozie-root-W@sqoop2hive]2016-12-06 16:23:14,494  INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] action completed, external ID [job_1480907021185_0025]2016-12-06 16:23:14,506  WARN SqoopActionExecutor:544 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]2016-12-06 16:23:14,603  INFO ActionEndXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] ERROR is considered as FAILED for SLA2016-12-06 16:23:14,687  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] Start action [0000001-161206155113885-oozie-root-W@fail] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]2016-12-06 16:23:14,688  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] [***0000001-161206155113885-oozie-root-W@fail***]Action status=DONE2016-12-06 16:23:14,688  INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] [***0000001-161206155113885-oozie-root-W@fail***]Action updated in DB!

解决方式:
这个问题让人很没办法,因为有效的log就只有一句:Error Message: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]。
先是查看了源码,然后是找遍了网上所有相关的问题及解决方法,但貌似都没能有效的搞定,唯一能够确定的就是这类错误应该是在hive命令执行过程中发生的。
因此每个人碰到这种错误的具体情况可能都不一样,只能对各个涉及到的参数进行试错,总能找到错误的地方。
最后找到本文产生这种错误的原因就是,所有url中的host请使用ip地址(如127.0.0.1),而不是主机名字,否则可能产生上面的问题。

6/ 一些其他说明:
(1)在workflow.xml中的各节点元素的顺序一定要同官方文档中对应,不要打乱,否则可能会报语法错误导致无法提交,比如:Error: E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element ‘prepare’. One of ‘{“uri:oozie:sqoop-action:0.2”:configuration, “uri:oozie:sqoop-action:0.2”:command, “uri:oozie:sqoop-action:0.2”:arg}’ is expected.

(2)需要开启hive的metastore服务,否则会报错:java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

(3)需要开启sqoop的metastore服务,否则会报错:ERROR tool.JobTool: I/O error performing job operation: java.io.IOException: Exception creating SQL connection

(4)关于hive的配置问题,可以通过文件方式配置(job-xml元素指定hive-site.xml文件在hdfs上的位置),也可以configuration元素中以property子元素方式给定(本文就是这种方式)。如果两种方式都用上,貌似configuration里面的配置信息会覆盖外部配置文件的内容。

(5)关于sqoop job执行的时候,需要输入数据库的密码,需要在sqoop-site.xml中设置sqoop.metastore.client.record.password为true,同时还需要configuration元素中设置,否则可能执行失败。

完整代码下载地址:

参考文献:
http://ingest.tips/2014/11/27/how-to-oozie-sqoop-hive/

1 0
原创粉丝点击