sqoop总结
来源:互联网 发布:数据上报系统 源码 编辑:程序博客网 时间:2024/06/08 02:05
不期而遇
最近,因为要做数据分析,这不,需要将postgresql中的数据导入到Hive中,然后经过hadoop分析处理后再将结果返回给postgresql,于是就遇到了sqoop。
她是什么
Sqoop项目开始于2009年,最早是作为Hadoop的一个第三方模块存在,后来为了让使用者能够快速部署,也为了让开发人员能够更快速的迭代开发,Sqoop独立成为一个Apache项目。基本上所有的企业重要的数据都存储在关系型数据库中。
解决了什么
Sqoop是一款开源的工具,主要用于在Hadoop(Hive、Hbase)与传统的数据库(mysql、postgresql...部分nosql)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据通过MapReduce作业一行行抽取然后导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
应用场景
如何获取
安装目录:
[root@dev 000000 appcom]$ ls -l
total 1016928
lrwxrwxrwx 1 hadoop hadoop 39 Mar 31 2015 anaconda -> /appcom/AnacondaInstall/anaconda-2.1.0/
drwxr-x--- 3 hadoop hadoop 4096 Mar 31 2015 AnacondaInstall
lrwxrwxrwx 1 root root 33 Sep 29 2014 apache -> /appcom/ApacheInstall/httpd-2.4.4
drwxr-x--- 4 hadoop hadoop 4096 Oct 9 2014 apache-config
drwxr-x--- 3 hadoop hadoop 4096 Sep 23 2013 ApacheInstall
drwxr-x--- 3 hadoop hadoop 4096 Apr 13 2016 applog
drwxrwxr-x 138 hadoop hadoop 4096 Dec 30 16:57 apps
drwxr-x--- 3 hadoop hadoop 4096 May 25 2016 apps-data
drwxr-x--- 2 hadoop hadoop 4096 Sep 26 2014 BakHadoop
drwxr-x--- 3 hadoop hadoop 4096 Apr 18 2014 DrbdInstall
lrwxrwxrwx 1 root root 36 Sep 29 2014 ganglia -> /appcom/GangliaInstall/ganglia-3.1.9
drwxr-x--- 2 hadoop hadoop 4096 Dec 23 2015 ganglia-config
drwxr-x--- 5 hadoop hadoop 4096 Sep 13 2013 GangliaInstall
lrwxrwxrwx 1 hadoop hadoop 34 Sep 26 2014 hadoop -> /appcom/HadoopInstall/hadoop-1.2.1
drwxr-x--- 4 hadoop hadoop 4096 Dec 8 19:04 hadoop-config
drwxr-x--- 3 hadoop hadoop 4096 Jan 21 2015 HadoopInstall
lrwxrwxrwx 1 hadoop hadoop 34 Sep 28 2014 hbase -> /appcom/HbaseInstall/hbase-0.94.11
drwxr-x--- 2 hadoop hadoop 4096 Nov 18 2015 hbase-config
drwxr-x--- 3 hadoop hadoop 4096 Nov 5 2013 HbaseInstall
drwxr-x--- 2 hadoop hadoop 4096 Oct 13 2014 hcat
lrwxrwxrwx 1 hadoop hadoop 31 Oct 16 2014 hive -> /appcom/HiveInstall/hive-0.13.1
drwxr-x--- 2 hadoop hadoop 4096 Dec 8 19:00 hive-config
-rw-r----- 1 hadoop hadoop 15001404 Aug 23 15:09 hive-exec-0.13.1.jar
-rw-r--r-- 1 hadoop hadoop 14996846 Aug 22 18:08 hive-exec-0.13.1-zhangguancheng-20160823.jar
drwxr-x--- 3 hadoop hadoop 4096 Dec 6 08:44 HiveInstall
drwxr-x--- 3 hadoop hadoop 4096 Nov 26 2015 JavaInstall
lrwxrwxrwx 1 hadoop hadoop 31 Nov 26 2015 jdk -> /appcom/JavaInstall/jdk1.7.0_71
lrwxrwxrwx 1 hadoop hadoop 31 Dec 9 2015 jdk7 -> /appcom/JavaInstall/jdk1.7.0_71
drwxrwxrwx 4 hadoop hadoop 4096 Dec 16 16:07 logs
drwx------ 2 hadoop hadoop 16384 Sep 24 2014 lost+found
lrwxrwxrwx 1 root root 27 Dec 13 15:11 mysql -> /appcom/MysqlInstall2/mysql
lrwxrwxrwx 1 mysql mysql 59 Oct 10 2014 mysql_bak_djm_20161213 -> /appcom/MysqlInstall/mysql-5.6.10-linux-glibc2.5-x86_64-new
drwxr-x--- 2 mysql mysql 4096 Sep 18 15:15 mysql-config
drwxr-x--- 5 mysql mysql 4096 Jul 25 11:01 MysqlInstall
drwxr-x--- 3 root root 4096 Dec 13 15:10 MysqlInstall2
-rw-r----- 1 spark spark 1010747548 Jul 26 17:48 mysql.tar.gz
lrwxrwxrwx 1 hadoop hadoop 25 Oct 16 2014 odpp -> /appcom/OptimInstall/odpp
lrwxrwxrwx 1 oozie oozie 32 Oct 9 2014 oozie -> /appcom/OozieInstall/oozie-3.3.2
drwxr-xr-x 19 oozie oozie 4096 Apr 12 2016 oozie-4.2.0
drwxr-x--- 4 oozie oozie 4096 Dec 19 13:54 oozie-config
drwxr-x--- 3 oozie oozie 4096 Jul 19 17:11 OozieInstall
lrwxrwxrwx 1 hadoop hadoop 25 Oct 16 2014 optim -> /appcom/OptimInstall/odpp
drwxr-x--- 3 hadoop hadoop 4096 May 6 2014 OptimInstall
drwxr-xr-x 7 hadoop hadoop 4096 Jan 14 2014 orahdfs-2.3.0
lrwxrwxrwx 1 storm storm 46 Oct 10 2014 rocketmq -> /appcom/RocketMQInstall/alibaba-rocketmq_3.0.8
drwxr-x--- 2 storm storm 4096 Mar 6 2014 rocketmq-config
drwxr-x--- 3 storm storm 4096 Sep 26 2014 RocketMQInstall
drwxr-x--- 15 hadoop hadoop 4096 Oct 24 09:44 softwares
lrwxrwxrwx 1 hadoop hadoop 34 Nov 26 2015 sonar -> /appcom/SonarInstall/sonarqube-5.2
drwxr-x--- 4 hadoop hadoop 4096 Nov 26 2015 SonarInstall
lrwxrwxrwx 1 hadoop hadoop 37 Nov 26 2015 sonar-runner -> /appcom/SonarInstall/sonar-runner-2.4
drwxr-x--- 2 hadoop hadoop 4096 Jun 2 2016 spark-on-oozie
lrwxrwxrwx 1 sqoop sqoop 50 Oct 13 2014 sqoop -> /appcom/SqoopInstall/sqoop-1.4.4.bin__hadoop-1.0.0
drwxr-x--- 2 sqoop sqoop 4096 Jun 17 2014 sqoop-config
drwxr-x--- 3 sqoop sqoop 4096 Sep 13 2013 SqoopInstall
lrwxrwxrwx 1 storm storm 50 Oct 10 2014 storm -> /appcom/StormInstall/apache-storm-0.9.1-incubating
drwxr-x--- 2 storm storm 4096 Oct 13 2014 storm-config
drwxr-x--- 4 storm storm 4096 Oct 10 2014 StormInstall
-rw-r----- 1 hadoop hadoop 362790 Jul 26 19:27 test-file-sink.out
lrwxrwxrwx 1 oozie oozie 36 Oct 9 2014 tomcat -> /appcom/TomcatInstall/tomcat-7.0.37/
lrwxrwxrwx 1 hadoop hadoop 36 Oct 10 2014 tomcat2 -> /appcom/TomcatInstall2/tomcat-7.0.37
lrwxrwxrwx 1 hadoop hadoop 36 May 28 2015 tomcat3 -> /appcom/TomcatInstall3/tomcat-7.0.37
lrwxrwxrwx 1 hadoop hadoop 43 Dec 9 2015 tomcat4 -> /appcom/TomcatInstall4/apache-tomcat-7.0.47
lrwxrwxrwx 1 hadoop hadoop 43 Jan 7 2016 tomcat5 -> /appcom/TomcatInstall5/apache-tomcat-7.0.47
lrwxrwxrwx 1 hadoop hadoop 44 Jan 29 2016 tomcat6 -> /appcom/TomcatInstall6/apache-tomcat-7.0.47/
drwxr-x--- 2 root root 4096 Dec 13 15:04 TomcatInstal_djm
drwxr-x--- 4 oozie oozie 4096 Jul 19 15:39 TomcatInstall
drwxr-x--- 3 hadoop hadoop 4096 Nov 14 2014 TomcatInstall2
drwxr-x--- 4 hadoop hadoop 4096 Dec 9 2015 TomcatInstall3
drwxr-x--- 3 hadoop hadoop 4096 Jan 14 2016 TomcatInstall4
drwxr-x--- 3 hadoop hadoop 4096 Jan 7 2016 TomcatInstall5
drwxr-x--- 3 hadoop hadoop 4096 Jan 22 2016 TomcatInstall6
drwx------ 3 root root 4096 Oct 12 10:26 yuqing_company_link_score2
drwxr-x--- 2 hadoop hadoop 4096 Mar 28 2016 zabbix-scripts
lrwxrwxrwx 1 hadoop hadoop 40 Sep 28 2014 zookeeper -> /appcom/ZookeeperInstall/zookeeper-3.4.5
drwxr-x--- 2 hadoop hadoop 4096 Feb 4 2016 zookeeper-config
drwxr-x--- 3 hadoop hadoop 4096 Sep 26 2014 ZookeeperInstall
如何部署
简单使用
sqoop help:
[root@dev-00000 ~]$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
1、postgresql中已建立的表导入到hive中:create-hive-table命令
2、导入:
[root@dev-l0000 bin]$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect
string
--connection-manager <class-name> Specify connection manager
class name
--connection-param-file <properties-file> Specify connection
parameters file
--driver <class-name> Manually specify JDBC
driver class to use
--hadoop-home <hdir> Override
$HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override
$HADOOP_MAPRED_HOME_ARG
--help Print usage instructions
-P Read password from console
--password <password> Set authentication
password
--password-file <password-file> Set authentication
password file path
--username <username> Set authentication
username
--verbose Print more information
while working
Import control arguments:
--append Imports data
in append
mode
--as-avrodatafile Imports data
to Avro data
files
--as-sequencefile Imports data
to
SequenceFile
s
--as-textfile Imports data
as plain
text
(default)
--boundary-query <statement> Set boundary
query for
retrieving
max and min
value of the
primary key
--columns <col,col,col...> Columns to
import from
table
--compression-codec <codec> Compression
codec to use
for import
--delete-target-dir Imports data
in delete
mode
--direct Use direct
import fast
path
--direct-split-size <n> Split the
input stream
every 'n'
bytes when
importing in
direct mode
-e,--query <statement> Import
results of
SQL
'statement'
--fetch-size <n> Set number
'n' of rows
to fetch
from the
database
when more
rows are
needed
--inline-lob-limit <n> Set the
maximum size
for an
inline LOB
-m,--num-mappers <n> Use 'n' map
tasks to
import in
parallel
--mapreduce-job-name <name> Set name for
generated
mapreduce
job
--split-by <column-name> Column of
the table
used to
split work
units
--table <table-name> Table to
read
--target-dir <dir> HDFS plain
table
destination
--validate Validate the
copy using
the
configured
validator
--validation-failurehandler <validation-failurehandler> Fully
qualified
class name
for
ValidationFa
ilureHandler
--validation-threshold <validation-threshold> Fully
qualified
class name
for
ValidationTh
reshold
--validator <validator> Fully
qualified
class name
for the
Validator
--warehouse-dir <dir> HDFS parent
for table
destination
--where <where clause> WHERE clause
to use
during
import
-z,--compress Enable
compression
Incremental import arguments:
--check-column <column> Source column to check for incremental
change
--incremental <import-type> Define an incremental import of type
'append' or 'lastmodified'
--last-value <value> Last imported value in the incremental
check column
Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing
character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL's default delimiter set:
fields: , lines: \n escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character
Input parsing arguments:
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape
character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line
char
--input-optionally-enclosed-by <char> Sets a field enclosing
character
Hive arguments:
--create-hive-table Fail if the target hive
table exists
--hive-database <database-name> Sets the database name to
use when importing to hive
--hive-delims-replacement <arg> Replace Hive record \0x01
and row delimiters (\n\r)
from imported string fields
with user-defined string
--hive-drop-import-delims Drop Hive record \0x01 and
row delimiters (\n\r) from
imported string fields
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive
(Uses Hive's default
delimiters if none are
set.)
--hive-overwrite Overwrite existing data in
the Hive table
--hive-partition-key <partition-key> Sets the partition key to
use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to
use when importing to hive
--hive-table <table-name> Sets the table name to use
when importing to hive
--map-column-hive <arg> Override mapping for
specific column to hive
types.
HBase arguments:
--column-family <family> Sets the target column family for the
import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the
row key
--hbase-table <table> Import to <table> in HBase
HCatalog arguments:
--hcatalog-database <arg> HCatalog database name
--hcatalog-home <hdir> Override $HCAT_HOME
--hcatalog-table <arg> HCatalog table name
--hive-home <dir> Override $HIVE_HOME
--hive-partition-key <partition-key> Sets the partition key to
use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to
use when importing to hive
--map-column-hive <arg> Override mapping for
specific column to hive
types.
HCatalog import specific options:
--create-hcatalog-table Create HCatalog before import
--hcatalog-storage-stanza <arg> HCatalog storage stanza for table
creation
Code generation arguments:
--bindir <dir> Output directory for compiled
objects
--class-name <name> Sets the generated class name.
This overrides --package-name.
When combined with --jar-file,
sets the input class.
--input-null-non-string <null-str> Input null non-string
representation
--input-null-string <null-str> Input null string representation
--jar-file <file> Disable code generation; use
specified jar
--map-column-java <arg> Override mapping for specific
columns to java types
--null-non-string <null-str> Null non-string representation
--null-string <null-str> Null string representation
--outdir <dir> Output directory for generated
code
--package-name <name> Put auto-generated classes in
this package
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
At minimum, you must specify --connect and --table
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.
导入过程:
导入之前,sqoop使用jdbc检查导入的表检索出列字段和字段类型,数据库类型会通过java类来映射到java数据类型,在mapreduce应用中通过对应的java类型来保存该字段的值(sqoop代码生产器使用这些信息来创建对应表的类,用于保存从表中抽象的记录)。sqoop启动的mapreduce作业用到一个InputFormat,他可以通过jdbc从一个数据库表中读取部分内容,hadoop提供的DataDrivernDBInputFormat能够为几个map任务对查询结果进行划分(通过划分列--splitting column来进行划分,根据表中元数据,sqoop选择合适的列(这个列通常是主键)进行划分,该列的最大值、最小值会被读到,然后根据map进行分组,确定每组的需要查询的数量),在生产反序列化代码和配置inputFormat后,sqoop将作业发送到mapreduce集群,map任务执行查询并且将resultset中的数据反序列化到生成类的实例,这些数据要么被直接被保存在sequenceFile文件中,要么写在HDFS之前被转换成分割的文本。为了解决大对象问题,sqoop将导入的大对象数据存储在LobFile(LobFile格式文件使用了64位的地址空间,故能够存储一个非常大的单条记录)格式的单独文件中。LobFile文件允许客户端持有对该记录的引用,而不访问记录内容,对记录的访问时通过InputStream、Reader来实现的。在一个map任务中,BlobRef和clobRef类会缓存对底层LobFile文件的引用,然后通过blobRef.getDateStream()来获取数据。
3、导出:
[root@dev-00000 bin]$ sqoop help export
usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect
string
--connection-manager <class-name> Specify connection manager
class name
--connection-param-file <properties-file> Specify connection
parameters file
--driver <class-name> Manually specify JDBC
driver class to use
--hadoop-home <hdir> Override
$HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override
$HADOOP_MAPRED_HOME_ARG
--help Print usage instructions
-P Read password from console
--password <password> Set authentication
password
--password-file <password-file> Set authentication
password file path
--username <username> Set authentication
username
--verbose Print more information
while working
Export control arguments:
--batch Indicates
underlying
statements
to be
executed in
batch mode
--call <arg> Populate the
table using
this stored
procedure
(one call
per row)
--clear-staging-table Indicates
that any
data in
staging
table can be
deleted
--columns <col,col,col...> Columns to
export to
table
--direct Use direct
export fast
path
--export-dir <dir> HDFS source
path for the
export
-m,--num-mappers <n> Use 'n' map
tasks to
export in
parallel
--mapreduce-job-name <name> Set name for
generated
mapreduce
job
--staging-table <table-name> Intermediate
staging
table
--table <table-name> Table to
populate
--update-key <key> Update
records by
specified
key column
--update-mode <mode> Specifies
how updates
are
performed
when new
rows are
found with
non-matching
keys in
database
--validate Validate the
copy using
the
configured
validator
--validation-failurehandler <validation-failurehandler> Fully
qualified
class name
for
ValidationFa
ilureHandler
--validation-threshold <validation-threshold> Fully
qualified
class name
for
ValidationTh
reshold
--validator <validator> Fully
qualified
class name
for the
Validator
Input parsing arguments:
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape
character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line
char
--input-optionally-enclosed-by <char> Sets a field enclosing
character
Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing
character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL's default delimiter set:
fields: , lines: \n escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character
Code generation arguments:
--bindir <dir> Output directory for compiled
objects
--class-name <name> Sets the generated class name.
This overrides --package-name.
When combined with --jar-file,
sets the input class.
--input-null-non-string <null-str> Input null non-string
representation
--input-null-string <null-str> Input null string representation
--jar-file <file> Disable code generation; use
specified jar
--map-column-java <arg> Override mapping for specific
columns to java types
--null-non-string <null-str> Null non-string representation
--null-string <null-str> Null string representation
--outdir <dir> Output directory for generated
code
--package-name <name> Put auto-generated classes in
this package
HCatalog arguments:
--hcatalog-database <arg> HCatalog database name
--hcatalog-home <hdir> Override $HCAT_HOME
--hcatalog-table <arg> HCatalog table name
--hive-home <dir> Override $HIVE_HOME
--hive-partition-key <partition-key> Sets the partition key to
use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to
use when importing to hive
--map-column-hive <arg> Override mapping for
specific column to hive
types.
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
At minimum, you must specify --connect, --export-dir, and --table
导出过程:
在执行导出之前,sqoop会根据数据库连接字符来选择一个导出的方法(大多会选择jdbc,会生成一批insert语句,每条语句会向目标表中插入多条记录,而不是仅仅一条),sqoop会根据目标表的定义生成一个java类(它可以从生成的文本中解析出几录,向目标表中插入合适类型的值),然后会启动一个mapreduct作业,从hdfs中读取源数据文件,使用生成的类解析出记录,并且执行选定的方法。
对于mysql数据库来说可以使用mysqlimport的直接模式方法。每个map任务会生成一个mysqlimport进程, 该进程通过本地文件系统上一个命名FIFO通道进行通信,数据通过这个通道流入mysqlimport,然后写入数据库
她有哪些坑
1、默认的并行机制要小心:默认情况下的并行意味着Sqoop假设大数据是在分区范围内均匀分布的。这在当你的源系统是使用一个序列号发生器来生成主键的时候工作得很好。打个比方,当你有一个10个节点的集群,那么工作负载是在这10台服务器上平均分配的。但是,如果你的分割键是基于字母数字的,拥有比如以“A”作为开头的键值的数量会是“M”作为开头键值数量的20倍,那么工作负载就会变成从一台服务器倾斜到另一台服务器上。
2、导入过程中默认分割符为“,”,如果字段内容中有逗号,那么需要指定分割符。
3、可以指定文件格式,默认为文本格式并不能保存数据库中类型为varbinary的列,所以需要指定文件格式为二进制文件(对应sequenceFile格式--仅仅支持java语言【序列化与反序列化时候会使用到自动生成的java类】和Avro格式--可以被多种语音处理如pyson),有时文本格式在区分null值和字符串null时可以出现问题(文本格式可以用--null-string '\\N' --null-non-string '\\N'来控制空值的表示方式)。
4、可以使用where子句来限定导入的记录,而不需到导入全量数据,每天的增量数据可以增加日期限定。
5、需要保证在执行导入HDFS数据时,源表不能进行修改,否则会出现数据不一致。
6、在使用直接模式(用--direct)(优点是快速、大吞吐率,但是没有jdbc模式通用)导入数据时,需要针对每个数据库进行具体处理,如源数据库是mysql,mysql的mysqldump并不支持大对象数据--如clob、blob,需要jdbc专用的api对这些列单独处理
总结
1、使用--hive -import参数可以从源数据库中直接将数据载入hive
2、通用参数:
选项含义说明
--connect <jdbc-uri>
指定JDBC连接字符串--connection-manager <class-name>
指定要使用的连接管理器类--driver <class-name>
指定要使用的JDBC驱动类--hadoop-mapred-home <dir>
指定$HADOOP_MAPRED_HOME路径--help
打印用法帮助信息--password-file
设置用于存放认证的密码信息文件的路径-P
从控制台读取输入的密码--password <password>
设置认证密码--username <username>
设置认证用户名--verbose
打印详细的运行信息--connection-param-file <filename>
可选,指定存储数据库连接参数的属性文件3、导人参数:
选项含义说明--append
将数据追加到HDFS上一个已存在的数据集上--as-avrodatafile
将数据导入到Avro数据文件--as-sequencefile
将数据导入到SequenceFile--as-textfile
将数据导入到普通文本文件(默认)--boundary-query <statement>
边界查询,用于创建分片(InputSplit)--columns <col,col,col…>
从表中导出指定的一组列的数据--delete-target-dir
如果指定目录存在,则先删除掉--direct
使用直接导入模式(优化导入速度)--direct-split-size <n>
分割输入stream的字节大小(在直接导入模式下)--fetch-size <n>
从数据库中批量读取记录数--inline-lob-limit <n>
设置内联的LOB对象的大小-m,--num-mappers <n>
使用n个map任务并行导入数据-e,--query <statement>
导入的查询语句--split-by <column-name>
指定按照哪个列去分割数据--table <table-name>
导入的源表表名--target-dir <dir>
导入HDFS的目标路径--warehouse-dir <dir>
HDFS存放表的根路径--where <where clause>
指定导出时所使用的查询条件-z,--compress
启用压缩--compression-codec <c>
指定Hadoop的codec方式(默认gzip)--null-string <null-string>
果指定列为字符串类型,使用指定字符串替换值为null的该类列的值--null-non-string <null-string>
如果指定列为非字符串类型,使用指定字符串替换值为null的该类列的值3、导出参数:
选项含义说明--validate <class-name>
启用数据副本验证功能,仅支持单表拷贝,可以指定验证使用的实现类--validation-threshold <class-name>
指定验证门限所使用的类--direct
使用直接导出模式(优化速度)--export-dir <dir>
导出过程中HDFS源路径-m,--num-mappers <n>
使用n个map任务并行导出--table <table-name>
导出的目的表名称--call <stored-proc-name>
导出数据调用的指定存储过程名--update-key <col-name>
更新参考的列名称,多个列名使用逗号分隔--update-mode <mode>
指定更新策略,包括:updateonly(默认)、allowinsert--input-null-string <null-string>
使用指定字符串,替换字符串类型值为null的列--input-null-non-string <null-string>
使用指定字符串,替换非字符串类型值为null的列--staging-table <staging-table-name>
在数据导出到数据库之前,数据临时存放的表名称--clear-staging-table
清除工作区中临时存放的数据--batch
使用批量模式导出4、转义序列可以用于指定非打印字符作为Sqoop中字段和记录的分割符--input -fields -terminated -by \\00015、可以使用临时阶段表来存储,导出过程中的中间结果--clear staging -table
6、sqoopdemo
dbExport.sh
#!/bin/shqueueName=$1mysql_connection=$2mysql_username=$3mysql_password=$4sqoop export -D mapred.job.queue.name=${queueName} --connect ${mysql_connection} --username ${mysql_username} --password ${mysql_password} --call 存储过程名 --input-fields-terminated-by '\t' -m 1 --export-dir /user/hive/warehouse/hive库名/表名;
dbImport.sh
#!/bin/shqueueName=$1mysql_connection=$2mysql_username=$3mysql_password=$4sqoop import -D mapred.job.queue.name=${queueName} --connect ${mysql_connection} --username ${mysql_username} --password ${mysql_password} --table 源表名 -m 1 --hive-table hive表名 --hive-drop-import-delims --fetch-size 200 --hive-overwrite --null-string '\\N' --null-non-string '\\N' --fields-terminated-by '\t' --hive-import;
doJob.sh
#!/bin/shhadoop fs -get /dependency/anaconda anaconda;tar -xzf anaconda/anaconda.tar.gz -C anaconda;hadoop fs -get /apps/demo/shell/demo.zip demo.zip;unzip -q demo.zip;cd demo;hive -e "select * from hive表名;">import/demo_1.txt;sed -i '1d' import/demo_cor.txt;hive -e "select * from hive表名;">import/spy_2.txt;sed -i '1d' import/spy_news.txt;../anaconda/bin/python demo.py > logs/demo_runtime_info.txt;if [ -f "export/news_meta_info.txt" ]; thenhive -e "LOAD DATA LOCAL INPATH 'export/news_meta_info.txt' OVERWRITE INTO TABLE hive表名";elsehive -e "use hive库;truncate table 表名;"fidate_key=`date "+%Y%m%d%H%M%S"`;zip -rq logs_${date_key}.zip logs;hadoop fs -put logs_${date_key}.zip /logs/demo/;
coordinator.xml<coordinator-app name="demo"frequency="${coord:days(1)}" start="${job_start}" end="${job_end}"timezone="GMT+08:00" xmlns="uri:oozie:coordinator:0.2"><controls> <timeout>2</timeout><concurrency>1</concurrency></controls><datasets><dataset name="bankLogDataset" frequency="${coord:days(1)}"initial-instance="${datesets_initial}" timezone="GMT+08:00"><uri-template>${hdfs_sqoop_tmp}/${YEAR}${MONTH}${DAY}</uri-template></dataset></datasets><output-events> <data-out name="logData" dataset="bankLogDataset"> <instance>${coord:current(-1)}</instance> </data-out> </output-events> <action> <workflow> <app-path>${application_path}</app-path> <configuration> <property> <name>sqoop_import_path</name> <value>${coord:dataOut('logData')}</value> </property> <property> <name>nominalformateDate</name> <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, 'DAY'), "yyyyMMdd")}</value> </property> </configuration> </workflow> </action></coordinator-app>
demo.properties
###################ods import properties file####################sqoop import confighdfs_sqoop_tmp=${namenode_address}/demohdfs_address_prefix=/apps/user/***#coordinator configjob_start=2015-12-18T03:00+0800job_end=2099-12-31T03:00+0800datesets_initial=2000-01-01T03:00+0800#shelldbImport=dbImport.shdbExport=dbExport.shdoJob=doJob.shapplication_path=${hdfs_address_prefix}/apps/demooozie.coord.application.path=${application_path}
runJob.sh#!/bin/sh
j_username=`grep j_username /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
j_password=`grep j_password /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
oozie_url=`grep oozie_url /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
doas_user=`grep doas_user /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
namenode_address=`grep namenode_address /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
jobtracker_address=`grep jobtracker_address /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
mapred_job_queue_name=`grep mapred_job_queue_name /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
mysql_connection=`grep mysql_connection /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
mysql_username=`grep mysql_username /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
mysql_password=`grep mysql_password /server.env|awk 'BEGIN{FS=":::::"}{print $NF}'`
hadoop fs -rmr /demo;
hadoop fs -rmr /apps/demo;
hadoop fs -put /appcom/apps/demo/ /apps/apps/demo/;
oozie -Dheader:j_username=$j_username -Dheader:j_password=$j_password job -Dnamenode_address=$namenode_address -Djobtracker_address=$jobtracker_address -Dmapred_job_queue_name=$mapred_job_queue_name -Dmysql_connection=$mysql_connection -Dmysql_username=$mysql_username -Dmysql_password=$mysql_password --oozie $oozie_url -config /appcom/apps/hduser0532/gbd_dm_qydb_safe/apps/demo/demo.properties -auth ldap -run -doas $doas_user
workflow.xml<workflow-app xmlns="uri:oozie:workflow:0.2" name="demo">
<start to="pre_clean_path_node"/>
<action name="pre_clean_path_node">
<fs>
<delete path="${sqoop_import_path}"/>
</fs>
<ok to="dbImport"/>
<error to="fail"/>
</action>
<action name="dbImport">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobtracker_address}</job-tracker>
<name-node>${namenode_address}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${mapred_job_queue_name}</value>
</property>
</configuration>
<exec>${dbImport}</exec>
<argument>${mapred_job_queue_name}</argument>
<argument>${mysql_connection}</argument>
<argument>${mysql_username}</argument>
<argument>${mysql_password}</argument>
<file>shell/${dbImport}</file>
</shell>
<ok to="doJob"/>
<error to="fail"/>
</action>
<action name="doJob">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobtracker_address}</job-tracker>
<name-node>${namenode_address}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${mapred_job_queue_name}</value>
</property>
</configuration>
<exec>${doJob}</exec>
<file>shell/${doJob}</file>
</shell>
<ok to="dbExport"/>
<error to="fail"/>
</action>
<action name="dbExport">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobtracker_address}</job-tracker>
<name-node>${namenode_address}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${mapred_job_queue_name}</value>
</property>
</configuration>
<exec>${dbExport}</exec>
<argument>${mapred_job_queue_name}</argument>
<argument>${mysql_connection}</argument>
<argument>${mysql_username}</argument>
<argument>${mysql_password}</argument>
<file>shell/${dbExport}</file>
</shell>
<ok to="clean_path_node"/>
<error to="fail"/>
</action>
<action name="clean_path_node">
<fs>
<delete path="${sqoop_import_path}"/>
</fs>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>ERI demo app failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
0 0
- sqoop总结
- sqoop总结
- sqoop导数据总结
- SQOOP
- Sqoop
- SQOOP
- SQOOP
- sqoop
- sqoop
- Sqoop
- sqoop
- sqoop
- sqoop
- Sqoop
- Sqoop
- sqoop
- sqoop
- sqoop
- JXTA 2: 具有高性能、海量伸缩性的 P2P 网络
- 最简单mysql5.6忘记密码处理方式
- STM32F429的LTDC和DMA2D
- 关于boundService使用方法
- Unity3D-使用custom font字体以及重叠问题
- sqoop总结
- iOS Socket编程
- linux学习笔记--host命令
- Git_git reset简介
- redis 启动命令
- 微信自动打招呼功能
- 中文乱码 encodeURI来解决URL传递时的中文问题
- SQL基础(4):数据控制
- 一道面试题到卡特兰数及其应用