Sqoop1.4.4将MySQL中数据导入到Hive表中
来源:互联网 发布:安卓一键免流软件 编辑:程序博客网 时间:2024/05/17 02:42
问题导读:
1、--hive-import、--hive-overwrite的作用?
2、如何处理关系型数据库字段中字符串含有分隔符问题?
3、使用--hive-import默认字段分隔符是?一行记录分隔符是?
4、NULL值是怎么处理的?--null-string和--null-string的作用?
5、--hive-partition-key作用?分区值类型必须是?
6、--compression-codec作用?
7、Sqoop将关系型数据库数据导入Hive中,默认导入数据库是?默认导入路径是?如何指定某个特定的数据库?
一、原理及部分参数简介
1、Sqoop import主要功能是将数据导入到HDFS文件中。但是如果你有与Hadoop集群关联的Hive元数据服务,Sqoop也可以将数据导入到Hive中,通过创建表语句CREATE TABLE <表名>在Hive中定义数据布局。导入数据到Hive中很简单,在Sqoop命令行中,使用--hive-import选项。
2、 如果Hive中表已经存在,你可以使用--hive-overwrite选择来表明,在Hive中的此表必须被替换。如果你的数据已经被导入到HDFS中或者此步骤被忽略了,Sqoop将会产生一个Hive脚本,它包含"CREATE TABLE"操作使用Hive数据类型定义列和“LOAD DATA INPATH” 语句的来将数据文件加载到Hive warehouse目录中。
3、尽管Hive支持转义字符,但是它并不处理新行字符的转义。Hive使用Sqoop导入数据将会出现问题,当你的数据行中包含着含有Hive默认行分隔符(\n or \r)的字符串字段或者列分隔符(\01)。你可以使用--hive-drop-import-delims选项将那些字符删掉,提供给Hive兼容的文本数据,或者使用--hive-delims-replacement选择替换掉那些字符以用户自定义字符串以提供给Hive兼容的文本数据。这些选项仅仅在你使用Hive默认分隔符的时候使用。在已经指定不同分隔符的情况下,不应该使用。
如果你不使用指定的分隔符而使用--hive-import,那么默认使用^A作为字段分隔符, \n作为一条记录分隔符。
4、Sqoop默认的会将NULL值以string类型导入,在Hive中默认是使用\N表示NULL。这会导致,在Hive中执行查询的使用(like IS NULL)不起作用。你可以使用 --null-string和 --null-non-string在导入操作的时候。或者使用--input-null-string和--input-null-non-string在导出的时候来保护NULL值。因为Sqoop将会在生成的代码中使用这些参数。所以需要将\N变为\\N。如:$ sqoop import ... --null-string '\\N' --null-non-string '\\N'。
5、使用--hive-table 修改输出表名
6、Hive可以将数据分区导入以便提高查询性能。你可以使用--hive-partition-key或者 --hive-partition-value参数告诉Sqoop job以指定分区形式导入Hive中。分区值必须是个字符串。
7、使用--compression-codec 如后面接"com.hadoop.compression.lzo.LzopCodec"可以实现压缩。
二、启动Hive元数据服务
Hive安装此处不介绍,我使用的版本是Hive0.13.1,Hadoop2.2.0
[hadoopUser@secondmgt ~]$ hive --service metastoreStarting Hive Metastore Server15/01/18 20:13:52 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize15/01/18 20:13:52 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node15/01/18 20:13:52 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack15/01/18 20:13:52 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize15/01/18 20:13:52 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
三、原始数据库
查看原数据库表中内容
mysql> select * from users;+----+-------------+----------+-----+-----------+------------+-------+------+| id | username | password | sex | content | datetime | vm_id | isad |+----+-------------+----------+-----+-----------+------------+-------+------+| 56 | hua | sqoop | 男 | 开通 | 2013-12-02 | 0 | 1 || 58 | feng | 123456 | 男 | 开通 | 2013-11-22 | 0 | 0 || 59 | test | 123456 | 男 | 开通 | 2014-03-05 | 58 | 0 || 60 | user1 | 123456 | 男 | 开通 | 2014-06-26 | 66 | 0 || 61 | user2 | 123 | 男 | 开通 | 2013-12-13 | 56 | 0 || 62 | user3 | 123456 | 男 | 开通 | 2013-12-14 | 0 | 0 || 64 | kai.zhou | 123456 | ? | ?? | 2014-03-05 | 65 | 0 || 65 | test1 | 111 | 男 | 未开通 | NULL | 0 | 0 || 66 | test2 | 111 | 男 | 未开通 | NULL | 0 | 0 || 67 | test3 | 113 | 男 | 未开通 | NULL | 0 | 0 || 68 | sqoopincr01 | 113 | 男 | 未开通 | NULL | 0 | 0 || 69 | sqoopincr02 | 113 | 男 | 未开通 | NULL | 0 | 0 || 70 | sqoopincr03 | 113 | 男 | 未开通 | NULL | 0 | 0 |+----+-------------+----------+-----+-----------+------------+-------+------+13 rows in set (0.00 sec)四、执行基本导入操作
[hadoopUser@secondmgt ~]$ sqoop import --hive-import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --table users --hive-table hiveusersWarning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.Please set $HCAT_HOME to the root of your HCatalog installation.15/01/18 20:22:19 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.15/01/18 20:22:19 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override15/01/18 20:22:19 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.15/01/18 20:22:19 WARN tool.BaseSqoopTool: It seems that you've specified at least one of following:15/01/18 20:22:19 WARN tool.BaseSqoopTool: --hive-home15/01/18 20:22:19 WARN tool.BaseSqoopTool: --hive-overwrite15/01/18 20:22:19 WARN tool.BaseSqoopTool: --create-hive-table15/01/18 20:22:19 WARN tool.BaseSqoopTool: --hive-table15/01/18 20:22:19 WARN tool.BaseSqoopTool: --hive-partition-key15/01/18 20:22:19 WARN tool.BaseSqoopTool: --hive-partition-value15/01/18 20:22:19 WARN tool.BaseSqoopTool: --map-column-hive15/01/18 20:22:19 WARN tool.BaseSqoopTool: Without specifying parameter --hive-import. Please note that15/01/18 20:22:19 WARN tool.BaseSqoopTool: those arguments will not be used in this session. Either15/01/18 20:22:19 WARN tool.BaseSqoopTool: specify --hive-import to apply them correctly or remove them15/01/18 20:22:19 WARN tool.BaseSqoopTool: from command line to remove this warning.15/01/18 20:22:19 INFO tool.BaseSqoopTool: Please note that --hive-home, --hive-partition-key,15/01/18 20:22:19 INFO tool.BaseSqoopTool: hive-partition-value and --map-column-hive options are15/01/18 20:22:19 INFO tool.BaseSqoopTool: are also valid for HCatalog imports and exports15/01/18 20:22:19 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.15/01/18 20:22:19 INFO tool.CodeGenTool: Beginning code generation15/01/18 20:22:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 115/01/18 20:22:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 115/01/18 20:22:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0Note: /tmp/sqoop-hadoopUser/compile/e5d129e2de5bcdea0f7e1db4beb24461/users.java uses or overrides a deprecated API.Note: Recompile with -Xlint:deprecation for details.15/01/18 20:22:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/e5d129e2de5bcdea0f7e1db4beb24461/users.jar15/01/18 20:22:21 WARN manager.MySQLManager: It looks like you are importing from mysql.15/01/18 20:22:21 WARN manager.MySQLManager: This transfer can be faster! Use the --direct15/01/18 20:22:21 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.15/01/18 20:22:21 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)15/01/18 20:22:21 INFO mapreduce.ImportJobBase: Beginning import of users15/01/18 20:22:21 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.addressSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]15/01/18 20:22:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar15/01/18 20:22:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps15/01/18 20:22:22 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:803215/01/18 20:22:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `users`15/01/18 20:22:23 INFO mapreduce.JobSubmitter: number of splits:415/01/18 20:22:23 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files15/01/18 20:22:23 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files15/01/18 20:22:23 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class15/01/18 20:22:23 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class15/01/18 20:22:23 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir15/01/18 20:22:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_002215/01/18 20:22:23 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0022 to ResourceManager at secondmgt/192.168.2.133:803215/01/18 20:22:23 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0022/15/01/18 20:22:23 INFO mapreduce.Job: Running job: job_1421373857783_002215/01/18 20:22:36 INFO mapreduce.Job: Job job_1421373857783_0022 running in uber mode : false15/01/18 20:22:36 INFO mapreduce.Job: map 0% reduce 0%15/01/18 20:22:46 INFO mapreduce.Job: map 25% reduce 0%15/01/18 20:22:50 INFO mapreduce.Job: map 75% reduce 0%15/01/18 20:22:56 INFO mapreduce.Job: map 100% reduce 0%15/01/18 20:22:56 INFO mapreduce.Job: Job job_1421373857783_0022 completed successfully15/01/18 20:22:57 INFO mapreduce.Job: Counters: 27 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=368076 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=401 HDFS: Number of bytes written=521 HDFS: Number of read operations=16 HDFS: Number of large read operations=0 HDFS: Number of write operations=8 Job Counters Launched map tasks=4 Other local map tasks=4 Total time spent by all maps in occupied slots (ms)=165592 Total time spent by all reduces in occupied slots (ms)=0 Map-Reduce Framework Map input records=13 Map output records=13 Input split bytes=401 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=192 CPU time spent (ms)=10500 Physical memory (bytes) snapshot=588156928 Virtual memory (bytes) snapshot=3525619712 Total committed heap usage (bytes)=337641472 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=52115/01/18 20:22:57 INFO mapreduce.ImportJobBase: Transferred 521 bytes in 35.2437 seconds (14.7828 bytes/sec)15/01/18 20:22:57 INFO mapreduce.ImportJobBase: Retrieved 13 records.15/01/18 20:22:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 115/01/18 20:22:57 WARN hive.TableDefWriter: Column datetime had to be cast to a less precise type in Hive15/01/18 20:22:57 INFO hive.HiveImport: Loading uploaded data into Hive15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed15/01/18 20:23:03 INFO hive.HiveImport:15/01/18 20:23:03 INFO hive.HiveImport: Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Class path contains multiple SLF4J bindings.15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]15/01/18 20:23:06 INFO hive.HiveImport: OK15/01/18 20:23:06 INFO hive.HiveImport: Time taken: 1.542 seconds15/01/18 20:23:06 INFO hive.HiveImport: Loading data to table default.hiveusers15/01/18 20:23:06 INFO hive.HiveImport: Table default.hiveusers stats: [numFiles=5, numRows=0, totalSize=521, rawDataSize=0]15/01/18 20:23:06 INFO hive.HiveImport: OK15/01/18 20:23:06 INFO hive.HiveImport: Time taken: 0.668 seconds15/01/18 20:23:06 INFO hive.HiveImport: Hive import complete.15/01/18 20:23:06 INFO hive.HiveImport: Export directory is empty, removing it.上面是完整的执行过程,内部执行实际分三部,1.将数据导入hdfs(可在hdfs上找到相应目录),2.创建hive表名相同的表,3,将hdfs上数据传入hive表中 。如果不指定数据库,默认是导入到default数据库中,导入指定的数据库中的某个表中可以使用"数据库名 . 表名"。默认存放路径是warehouse/<表名>/目录下。
查看结果:
hive> select * from hiveusers;OK56 hua sqoop 男 开通 2013-12-02 0 158 feng 123456 男 开通 2013-11-22 0 059 test 123456 男 开通 2014-03-05 58 060 user1 123456 男 开通 2014-06-26 66 061 user2 123 男 开通 2013-12-13 56 062 user3 123456 男 开通 2013-12-14 0 064 kai.zhou 123456 ? ?? 2014-03-05 65 065 test1 111 男 未开通 null 0 066 test2 111 男 未开通 null 0 067 test3 113 男 未开通 null 0 068 sqoopincr01 113 男 未开通 null 0 069 sqoopincr02 113 男 未开通 null 0 070 sqoopincr03 113 男 未开通 null 0 0Time taken: 0.46 seconds, Fetched: 13 row(s)desc describehive> desc hiveusers;OKid intusername stringpassword stringsex stringcontent stringdatetime stringvm_id intisad intTime taken: 0.133 seconds, Fetched: 8 row(s)
由上面查询结果可知NULL被当做字符串处理了,以null被导入进Hive表中
五、对NULL值进行处理[hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --table users --hive-table test.hiveusers --hive-import --hive-overwrite --null-string '\\N' --null-non-string '\\N'再次查看结果:
hive> select * from hiveusers;OK56 hua sqoop 男 开通 2013-12-02 0 158 feng 123456 男 开通 2013-11-22 0 059 test 123456 男 开通 2014-03-05 58 060 user1 123456 男 开通 2014-06-26 66 061 user2 123 男 开通 2013-12-13 56 062 user3 123456 男 开通 2013-12-14 0 064 kai.zhou 123456 ? ?? 2014-03-05 65 065 test1 111 男 未开通 NULL 0 066 test2 111 男 未开通 NULL 0 067 test3 113 男 未开通 NULL 0 068 sqoopincr01 113 男 未开通 NULL 0 069 sqoopincr02 113 男 未开通 NULL 0 070 sqoopincr03 113 男 未开通 NULL 0 0Time taken: 0.064 seconds, Fetched: 13 row(s)NULL值已正确处理
六、完整例子
附上一个完整例子,压缩没验证,有兴趣的可以验证下是否正确,可与我交流一下:
[hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --table users --hive-table test.hiveusers --hive-import --hive-overwrite --null-string '\\N' --null-non-string '\\N' --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --compression-codec "com.hadoop.compression.lzo.LzopCodec"
上一篇:Sqoop1.4.4将文件数据集从HDFS中导出到MySQL数据库表中
下一篇:Sqoop1.4.4将MySQL数据库表中数据导入到HBase表中
- Sqoop1.4.4将MySQL中数据导入到Hive表中
- Sqoop1.4.4将MySQL数据库表中数据导入到HBase表中
- 使用Sqoop1.4.4将MySQL数据库表中数据导入到HDFS中
- Sqoop1.4.4使用SQL语句形式将MySQL数据库表中数据导入到HDFS中
- sqoop1.4.5将mysql中的数据导入到HBase中
- Sqoop1.4.4使用增量导入模式将MySQL数据库中数据导入到HDFS中
- Sqoop将MySQL中数据导入到Hive表中
- Sqoop1.4.4将文件数据集从HDFS中导出到MySQL数据库表中
- 使用sqoop将mysql中数据导入到hive中
- 利用sqoop1将mysql数据导入至hive多分区
- OOzie调度sqoop1 Action 从mysql导入数据到hive
- sqoop将Mysql数据导入Hive中
- 使用sqoop1将hive导入mysql
- 使用sqoop1,将sqlserver数据导入hive
- 将Hive表数据导入到mysql
- 使用sqoop1.99.6将mysql数据导入到hdfs
- 使用sqoop将hive中的表数据导入到mysql数据库表中,错误解决
- 用Sqoop将mysql中的表和数据导入到Hive中
- ClassNotFoundException 和 NoClassDefFoundError 区别
- html登录页面整理
- MRUnit的安装和使用
- 《生产者与消费者》简单例子?
- DevExpress中,gridControl控件用法技巧
- Sqoop1.4.4将MySQL中数据导入到Hive表中
- Oracle中的动态SQL
- hdu 3081 【二分匹配+并查集+删边||最大路+并查集+二分枚举】
- Windows环境下Android Studio v1.0安装教程
- eclipse优化及svn安装
- Android应用开发之代码混淆
- Robot Framework(2)-RIDE功能详述
- linux服务器传文件如何不用输密码
- 安装apk到SD卡失败