使用sqoop2实现hdfs与Oracle数据相互导入
来源:互联网 发布:java线程优先级main 编辑:程序博客网 时间:2024/05/16 05:35
上一篇已经完成了sqoop2的安装,本篇文章介绍sqoop2将数据从Oracle导入hdfs已经从hdfs导入Oracle
sqoop的使用,主要分为以下几个部分
- 连接服务器
- 搜索connectors
- 创建Link
- 创建Job
- 执行Job
- 查看Job运行信息
在使用sqoop2之前,还需要对hadoop的配置文件做以下修改并启动jobhistory服务,否则,job只能运行一次,第二次运行就会报错
1)配置mapred-site.xml:(namenode节点即可)
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.101.11:10020</value>
</property>
2)启动jobhistory服务
sbin/mr-jobhistory-daemon.sh start historyserver
1、连接sqoop2服务器
sqoop2是c/s模式,之前安装并启动的是sqoop的服务端,如果想要使用和操作sqoop2,需要通过客户端连接到服务端,然后使用。只要保证服务端正常处于运行状态,客户端可以从任意终端发起。也就是说,客户端可以是一台只解压了sqoop2安装包的机器中启动,无需hadoop和sqoopserver。
启动客户端bin/sqoop2-shell
[hadoop@hadoop01 sqoop-1.99.7-bin-hadoop200]$ bin/sqoop2-shellSetting conf dir: /home/hadoop/sqoop-1.99.7-bin-hadoop200/bin/../confSqoop home directory: /home/hadoop/sqoop-1.99.7-bin-hadoop200Sqoop Shell: Type 'help' or '\h' for help.sqoop:000> \hFor information about Sqoop, visit: http://sqoop.apache.org/Available commands: :exit (:x ) Exit the shell :history (:H ) Display, manage and recall edit-line history help (\h ) Display this help message set (\st ) Configure various client options and settings show (\sh ) Display various objects and configuration options create (\cr ) Create new object in Sqoop repository delete (\d ) Delete existing object in Sqoop repository update (\up ) Update objects in Sqoop repository clone (\cl ) Create new object based on existing one start (\sta) Start job stop (\stp) Stop job status (\stu) Display status of a job enable (\en ) Enable object in Sqoop repository disable (\di ) Disable object in Sqoop repository grant (\g ) Grant access to roles and assign privileges revoke (\r ) Revoke access from roles and remove privilegesFor help on a specific command type: help commandsqoop:000>
此时只是启动了客户端,还没有连接服务端,如果测试创建link或者job,那么默认会连接localhost。所以,要先设置连接服务器:
sqoop:000> set server --host 192.168.101.12 --port 12000 --webapp sqoopServer is set successfullysqoop:000>
2、搜索connector
connector是用于创建link的模板,例如要创建hdfs的link,则使用hdfs-connector,要创建Oracle的link,则使用oracle-jdbc-connector或者generic-jdbc-connector
如果所设置的服务端连接是正常可用的,那么执行show connector 命令后,会显示如下:
sqoop:000> show connector +------------------------+---------+------------------------------------------------------------+----------------------+| Name | Version | Class | Supported Directions |+------------------------+---------+------------------------------------------------------------+----------------------+| generic-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO || kite-connector | 1.99.7 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO || oracle-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO || ftp-connector | 1.99.7 | org.apache.sqoop.connector.ftp.FtpConnector | TO || hdfs-connector | 1.99.7 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO || kafka-connector | 1.99.7 | org.apache.sqoop.connector.kafka.KafkaConnector | TO || sftp-connector | 1.99.7 | org.apache.sqoop.connector.sftp.SftpConnector | TO |+------------------------+---------+------------------------------------------------------------+----------------------+
否则,会提示拒绝连接
sqoop:000> show connector Exception has occurred during processing command Exception: java.net.ConnectException Message: 拒绝连接 (Connection refused)
3、创建link
顾名思义,就是创建到各个源或目标存储的连接。可创建的link类型,在上一个步骤show connector中已经列出
1)创建hdfs link
sqoop:000> create link --connector hdfs-connectorCreating link for connector with name hdfs-connectorPlease fill following values to create new link objectName: hdfs-link ##link的名称,可以随意起HDFS clusterURI: hdfs://192.168.101.11:9000 ##hdfs地址,在hadoop配置文件core-site.xml中查看Conf directory: /home/hadoop/hadoop-2.6.5/etc/hadoop ##hadoop配置文件目录Additional configs:: There are currently 0 values in the map:entry# ##直接回车即可New link was successfully created with validation status OK and name hdfs-linksqoop:000> show link+-----------+----------------+---------+| Name | Connector Name | Enabled |+-----------+----------------+---------+| hdfs-link | hdfs-connector | true |+-----------+----------------+---------+
2)创建Oracle link
sqoop:000> create link --connector generic-jdbc-connectorCreating link for connector with name generic-jdbc-connectorPlease fill following values to create new link objectName: oracle-link ##link的名称,可以随意起Database connectionDriver class: oracle.jdbc.driver.OracleDriver ##Oracle jdbc驱动Connection String: jdbc:oracle:thin:@192.168.101.9:1521:orcl ##Oracle连接字符串Username: scott ##Oracle用户名Password: ***** #密码Fetch Size: #以下直接回车,默认即可Connection Properties: There are currently 0 values in the map:entry# SQL DialectIdentifier enclose: ##sql定界符,为避免错误,需要打一个空格New link was successfully created with validation status OK and name oracle-linksqoop:000> show link+-------------+------------------------+---------+| Name | Connector Name | Enabled |+-------------+------------------------+---------+| oracle-link | generic-jdbc-connector | true || hdfs-link | hdfs-connector | true |+-------------+------------------------+---------+
3、创建从oracle向hdfs导入数据的job
sqoop:000> create job -f oracle-link -t hdfs-linkCreating job for links with from name oracle-link and to name hdfs-linkPlease fill following values to create new job objectName: oracle2hdfs #job名称Database sourceSchema name: scott #oracle的schema,即用户名Table name: emp #需要导入的表名SQL statement: ##SQL,默认导入整张表Column names: There are currently 0 values in the list:element# Partition column: empno ##指定一个列名即可,一般可以用主键,或者时间列,以便mapper任务的切分Partition column nullable: Boundary query: Incremental readCheck column: Last value: Target configurationOverride null value: Null value: File format: 0 : TEXT_FILE 1 : SEQUENCE_FILE 2 : PARQUET_FILEChoose: 0 #选择导入hdfs的一种格式,选择txt即可Compression codec: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY 8 : CUSTOMChoose: 0 #默认无压缩Custom codec: Output directory: /data ##导入到hdfs的目录,注意,全量导入时,该目录需要为空Append mode: ##追加模式,默认为全量Throttling resourcesExtractors: Loaders: Classpath configurationExtra mapper jars: There are currently 0 values in the list:element# New job was successfully created with validation status OK and name oracle2hdfssqoop:000>
4、创建从hdfs向oracle导入的job
sqoop:000> create job -f hdfs-link -t oracle-linkCreating job for links with from name hdfs-link and to name oracle-linkPlease fill following values to create new job objectName: hdfs2oracleInput configurationInput directory: /data ##hdfs 的导入目录Override null value: Null value: Incremental importIncremental type: 0 : NONE 1 : NEW_FILESChoose: 0 #默认选0即可,应该是配合上一个参数,Incremental import,设置增量导入模式Last imported date: Database targetSchema name: scott #oracle用户Table name: emp2 #oracle表名,需要提前创建好表结构Column names: There are currently 0 values in the list:element# Staging table: Clear stage table: Throttling resourcesExtractors: Loaders: Classpath configurationExtra mapper jars: There are currently 0 values in the list:element# New job was successfully created with validation status OK and name hdfs2oraclesqoop:000> sqoop:000> show job+----+-------------+--------------------------------------+--------------------------------------+---------+| Id | Name | From Connector | To Connector | Enabled |+----+-------------+--------------------------------------+--------------------------------------+---------+| 8 | oracle2hdfs | oracle-link (generic-jdbc-connector) | hdfs-link (hdfs-connector) | true || 9 | hdfs2oracle | hdfs-link (hdfs-connector) | oracle-link (generic-jdbc-connector) | true |+----+-------------+--------------------------------------+--------------------------------------+---------+sqoop:000>
5、运行作业
start job -name oracle2hdfs
sqoop:000> start job -name oracle2hdfsSubmission detailsJob Name: oracle2hdfsServer URL: http://192.168.101.12:12000/sqoop/Created by: hadoopCreation date: 2017-10-17 17:24:23 CSTLastly updated by: hadoopExternal ID: job_1508227894698_0001 http://hadoop01:8088/proxy/application_1508227894698_0001/2017-10-17 17:24:23 CST: BOOTING - Progress is not availablesqoop:000>
sqoop:000> start job -name hdfs2oracleSubmission detailsJob Name: hdfs2oracleServer URL: http://192.168.101.12:12000/sqoop/Created by: hadoopCreation date: 2017-10-17 17:25:38 CSTLastly updated by: hadoopExternal ID: job_1508227894698_0002 http://hadoop01:8088/proxy/application_1508227894698_0002/2017-10-17 17:25:38 CST: BOOTING - Progress is not available
由于sqoop2的job提交后,是运行在hadoop集群上的mapreduce任务,所以可以通过hadoop的web管理界面来查看,地址为:http://namenode:8088/cluster,对应的id就是External ID.
如下图所示:
7、验证结果
hdfs
[hadoop@hadoop02 bin]$ ./hadoop fs -ls /data/Found 13 items-rw-r--r-- 2 hadoop supergroup 172 2017-10-17 17:40 /data/129d2e07-1a0f-44ec-bbcd-299985b6972d.txt-rw-r--r-- 2 hadoop supergroup 119 2017-10-17 17:40 /data/1fffd203-a20f-4925-a328-8fe05994b80a.txt-rw-r--r-- 2 hadoop supergroup 119 2017-10-17 17:36 /data/2b8c5cd1-9ec2-4ae7-8683-a4587def6b74.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:40 /data/31b026a3-697c-4b11-a49d-4c02adfaf847.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:35 /data/3f6b78a3-a4ee-43ea-b4f3-fb46ae1a4ce0.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:37 /data/7ed1cc68-a1c7-4c98-ac25-c2d9d2ce1a54.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:33 /data/8ca051dd-6d80-4663-b062-931683d01026.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:40 /data/95cedc62-7355-4408-8361-cbf875645e94.txt-rw-r--r-- 2 hadoop supergroup 54 2017-10-17 17:37 /data/9e59a5b9-90d3-4b04-aa59-d20302e4ee2b.txt-rw-r--r-- 2 hadoop supergroup 0 2017-10-17 17:34 /data/b3229660-0409-4d33-9b2f-df9cdb2549ec.txt-rw-r--r-- 2 hadoop supergroup 166 2017-10-17 17:40 /data/b59fa170-7c5e-4b1e-8ca3-083bb09a9544.txt-rw-r--r-- 2 hadoop supergroup 57 2017-10-17 17:40 /data/cf92a71a-8528-4842-8a88-c4b4ea4b96e2.txt-rw-r--r-- 2 hadoop supergroup 114 2017-10-17 17:36 /data/dda07b7d-01a8-420d-964b-d2a5bee35168.txt[hadoop@hadoop02 bin]$ ./hadoop fs -cat /data/dda07b7d-01a8-420d-964b-d2a5bee35168.txt7782,'CLARK','MANAGER',7839,'1981-06-09',2450.00,NULL,107788,'SCOTT','ANALYST',7566,'1987-04-19',3000.00,NULL,20[hadoop@hadoop02 bin]$
oracle
后记:
途中遇到不少报错,记录如下:
--报错1
[hadoop@hadoop01 sqoop-1.99.7-bin-hadoop200]$ ./bin/sqoop.sh server start
Setting conf dir: ./bin/../conf
Sqoop home directory: /home/hadoop/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
1 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
20 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:36)
at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
原因:sqoop没有正确的找的hadoop的jar包,因为在安装hadoop时,配置了HADOOP_COMMON_HOME、HADOOP_HDFS_HOME、HADOOP_MAPRED_HOME、HADOOP_YARN_HOME四个环境变量
解决方法:
修改bin/sqoop.sh
将以下四行注释
HADOOP_COMMON_HOME=${HADOOP_COMMON_HOME:-${HADOOP_HOME}/share/hadoop/common}
HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/share/hadoop/hdfs}
HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/share/hadoop/mapreduce}
HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/share/hadoop/yarn}
新增以下四行,也可以把${HADOOP_HOME}换成绝对路径
HADOOP_COMMON_HOME=${HADOOP_HOME}/share/hadoop/common
HADOOP_HDFS_HOME=${HADOOP_HOME}/share/hadoop/hdfs
HADOOP_MAPRED_HOME=${HADOOP_HOME}/share/hadoop/mapreduce
HADOOP_YARN_HOME=${HADOOP_HOME}/share/hadoop/yarn
----报错2
GENERIC_JDBC_CONNECTOR_0025:No primary key - Please specify partition column.
原因:在创建关系型数据库到hdfs的job时Partition column: 参数没有指定
解决方法:指定一个列名即可,一般可以用主键,或者时间列,以便mapper任务的切分
PS:由于sqoop会以此列作为mapper的划分,如果此列数据分布存在倾斜,会导致某几个mapper负载过重,而其他mapper无数据可跑的问题,所以尽量选择数据均匀分布的类作为Partition column
--报错3
GENERIC_HDFS_CONNECTOR_0007:Invalid input/output directory - Unexpected exception
GENERIC_HDFS_CONNECTOR_0007:Invalid input/output directory - Output directory is not empty
原因:hdfs输出目录必须为空,从关系型数据库将数据到入到hdfs中时,所指定的hdfs目录必须为空
解决方法:新建一个hdfs目录,或者将旧目录清空即可
--报错4
Exception: org.apache.sqoop.common.SqoopException Message: DRIVER_0002:Given job is already running - Job with name hdfs2orcl_3
原因:
没有启动hadoop的jobhistory服务
解决方法:
配置mapred-site.xml:(namenode节点即可)
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.101.11:10020</value>
</property>
启动jobhistory服务
sbin/mr-jobhistory-daemon.sh start historyserver
创建generic-jdbc-connector link 时:
--报错5
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0016:Can't fetch schema -
Identifier enclose:指定SQL中标识符的定界符,有的SQL标示符是一个引号,这个属性默认值就是双引号,例如oracle数据库,你在输入表名和schema的时候,用的是小写的emp和scoot,那么sql就会转化成 SELECT * FROM "scott"."emp" ,从而导致报错
- 使用sqoop2实现hdfs与Oracle数据相互导入
- 使用Sqoop将HDFS/Hive/HBase与MySQL/Oracle中的数据相互导入、导出
- 使用Sqoop将HDFS/Hive/HBase与MySQL/Oracle中的数据相互导入、导出
- Sqoop2 MySQL与HDFS导入导出
- Sqoop_详细总结 使用Sqoop将HDFS/Hive/HBase与MySQL/Oracle中的数据相互导入、导出
- Sqoop_详细总结 使用Sqoop将HDFS/Hive/HBase与MySQL/Oracle中的数据相互导入、导出
- Sqoop2 从MySQL导入数据到Hadoop HDFS
- Sqoop工具实现,hive、Hbase、HDFS与mysql、oracle数据导入
- Sqoop2学习(二)—将MySQL数据库中数据导入到HDFS中
- Sqoop2——导入关系型数据库数据到HDFS上
- 使用sqoop将oracle数据导入hdfs集群
- 使用hdfs-slurper 将数据导入hdfs
- 通过sqoop 实现hdfs与mysql的数据导入导出
- Sqoop2.0 mysql sql查询结果导入hdfs脚本
- Sqoop2的安装与使用
- sqoop:mysql和Hbase/Hive/Hdfs之间相互导入数据
- oracle使用数据导出与导入
- oracle使用数据导出与导入
- python用matplotlib画图例子(1)
- Qt on Android: Android SDK安装
- 【4】Python 输入输出
- 栈和队列(基本实现)
- hdu 1501 Zipper
- 使用sqoop2实现hdfs与Oracle数据相互导入
- "isELignored=false"的作用
- 流媒体简介
- 什么是数据库索引,索引有什么作用
- js绑定c++(JSB或者jsbinding)的小白教程! (已完结)
- 第一次参加公司研发部门高级别会议之感
- Mybatis
- 正则表达式语法及用法---
- 键盘按键对应的ASCII码值