Hive 五种数据导入方式介绍
来源:互联网 发布:js选下一个兄弟节点 编辑:程序博客网 时间:2024/06/05 14:34
问题导读:
1、Hive五种数据导入方式有哪些?
2、导入表命令中有无LOCAL关键字的区别?
3、使用OVERWRITE或INTO关键字的作用及区别?
4、INPATH路径的限制?
5、什么是动态分区插入?
6、动态分区插入需要做哪些设置?
内容:
既然Hive没有行级别的数据插入、更新和删除操作,那么往表中装载数据的唯一途径就是使用一种”大量“的数据装载操作。我们以如下格式文件演示五种数据导入Hive方式
Tom 24 NanJing Nanjing UniversityJack 29 NanJing Southeast China UniversityMary Kake 21 SuZhou Suzhou UniversityJohn Doe 24 YangZhou YangZhou UniversityBill King 23 XuZhou Xuzhou Normal University
数据格式以\t分隔,分别表示:姓名、年龄、地址、学校
一、从本地文件系统中导入数据
(1) 创建test1测试表
hive> CREATE TABLE test1(name STRING,age INT, address STRING,school STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE ;OKTime taken: 0.078 seconds
(2) 从本地加载数据
hive> LOAD DATA LOCAL INPATH "/home/hadoopUser/data/test1.txt" > INTO TABLE test1;Copying data from file:/home/hadoopUser/data/test1.txtCopying file: file:/home/hadoopUser/data/test1.txtLoading data to table hive.test1Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]OKTime taken: 0.364 seconds(3) 查看导入结果
hive> select * from test1;OKTom 24 NanJing Nanjing UniversityJack 29 NanJing Southeast China UniversityMary Kake 21 SuZhou Suzhou UniversityJohn Doe 24 YangZhou YangZhou UniversityBill King 23 XuZhou Xuzhou Normal UniversityTime taken: 0.533 seconds, Fetched: 5 row(s)注意:此处使用的是LOCAL,表示从本地文件系统中加载数据到Hive中,同时没有OVERWRITE关键字,仅仅会把新增的文件增加到目标文件夹而不会删除之前的数据。如果使用OVERWRITE关键字,那么目标文件夹中之前的数据将会被先删除掉。
二、从HDFS文件系统加载数据到Hive
(1) 清空之前创建的表中数据
insert overwrite table test1 select * from test1 where 1=0; //清空表,一般不推荐这样操作(2) 从HDFS加载数据
hive> LOAD DATA INPATH "/input/test1.txt" > OVERWRITE INTO TABLE test1;Loading data to table hive.test1rmr: DEPRECATED: Please use 'rm -r' instead.Deleted hdfs://secondmgt:8020/hive/warehouse/hive.db/test1Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]OKTime taken: 0.355 seconds(3) 查询结果
hive> select * from test1;OKTom 24.0 NanJing Nanjing UniversityJack 29.0 NanJing Southeast China UniversityMary Kake 21.0 SuZhou Suzhou UniversityJohn Doe 24.0 YangZhou YangZhou UniversityBill King 23.0 XuZhou Xuzhou Normal UniversityTime taken: 0.054 seconds, Fetched: 5 row(s)
注意:此处没有LOCAL关键字,表示分布式文件系统中的路径,这就是和第一种方法的主要区别,同时由日志可以发现,因为此处加了OVERWRITE关键字,执行了Deleted操作,即先删除之前存储的数据,然后再执行加载操作。
同时,INPATH子句中使用的文件路径还有一个限制,那就是这个路径下不可以包含任何文件夹。
三、通过查询语句向表中插入数据
(1) 创建test4测试表
hive> CREATE TABLE test4(name STRING,age FLOAT,address STRING,school STRING) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE ;OKTime taken: 0.251 seconds创建表过程基本和前面一样,此处不细讲
(2) 从查询结果中导入数据
hive> INSERT INTO TABLE test4 SELECT * FROM test1;Total jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1419317102229_0032, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0032/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0032Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-29 10:49:25,409 Stage-1 map = 0%, reduce = 0%2014-12-29 10:49:36,900 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.83 secMapReduce Total cumulative CPU time: 2 seconds 830 msecEnded Job = job_1419317102229_0032Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-29_10-49-10_009_3039854442660229613-1/-ext-10000Loading data to table hive.test4Table hive.test4 stats: [numFiles=1, numRows=5, totalSize=211, rawDataSize=206]MapReduce Jobs Launched:Job 0: Map: 1 Cumulative CPU: 2.83 sec HDFS Read: 415 HDFS Write: 278 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 830 msecOKTime taken: 28.495 seconds
注意:新建表的字段数,一定要和后面SELECT中查询的字段数一样,且要注意数据类型。如test4包含四个字段:name、age、address和school,则SELECT查询出的结果也应该对应这四个字段。
(3) 查看导入结果
hive> select * from test4;OKTom 24.0 NanJing Nanjing UniversityJack 29.0 NanJing Southeast China UniversityMary Kake 21.0 SuZhou Suzhou UniversityJohn Doe 24.0 YangZhou YangZhou UniversityBill King 23.0 XuZhou Xuzhou Normal UniversityTime taken: 0.066 seconds, Fetched: 5 row(s)四、分区插入
分区插入有两种,一种是静态分区,另一种是动态分区。如果混合使用静态分区和动态分区,则静态分区必须出现在动态分区之前。现分别介绍这两种分区插入
(1) 静态分区插入
①创建分区表
hive> CREATE TABLE test2(name STRING,address STRING,school STRING) > PARTITIONED BY(age float) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE ;OKTime taken: 0.144 seconds
此处创建了一个test2的分区表,以年龄分区
②从查询结果中导入数据
hive> INSERT INTO TABLE test2 PARTITION (age='24') SELECT * FROM test1;FAILED: SemanticException [Error 10044]: Line 1:19 Cannot insert into target table because column number/types are different ''24'': Table insclause-0 has 3 columns, but query has 4 columns.此处报了一个错误。是因为test2中是以age分区的,有三个字段,SELECT * 语句中包含有四个字段,所以出错。正确如下:
hive> INSERT INTO TABLE test2 PARTITION (age='24') SELECT name,address,school FROM test1;③ 查看插入结果
hive> select * from test2;OKTom NanJing Nanjing University 24.0Jack NanJing Southeast China University 24.0Mary Kake SuZhou Suzhou University 24.0John Doe YangZhou YangZhou University 24.0Bill King XuZhou Xuzhou Normal University 24.0Time taken: 0.079 seconds, Fetched: 5 row(s)由查询结果可知,每条记录的年龄均为24,插入成功。
(2) 动态分区插入
静态分区需要创建非常多的分区,那么用户就需要写非常多的SQL!Hive提供了一个动态分区功能,其可以基于查询参数推断出需要创建的分区名称。
① 创建分区表,此过程和静态分区创建表一样,此处省略
② 参数设置
hive> set hive.exec.dynamic.partition=true;hive> set hive.exec.dynamic.partition.mode=nonstrict;注意:动态分区默认情况下是没有开启的。开启后,默认是以”严格“模式执行的,在这种模式下要求至少有一列分区字段是静态的。这有助于阻止因设计错误导致查询产生大量的分区。但是此处我们不需要静态分区字段,估将其设为nonstrict。
③ 数据动态插入
hive> insert into table test2 partition (age) select name,address,school,age from test1;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1419317102229_0029, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0029/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0029Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12014-12-28 20:45:07,996 Stage-1 map = 0%, reduce = 0%2014-12-28 20:45:21,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.67 sec2014-12-28 20:45:32,926 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.32 secMapReduce Total cumulative CPU time: 7 seconds 320 msecEnded Job = job_1419317102229_0029Loading data to table hive.test2 partition (age=null) Loading partition {age=29.0} Loading partition {age=23.0} Loading partition {age=21.0} Loading partition {age=24.0}Partition hive.test2{age=21.0} stats: [numFiles=1, numRows=1, totalSize=35, rawDataSize=34]Partition hive.test2{age=23.0} stats: [numFiles=1, numRows=1, totalSize=42, rawDataSize=41]Partition hive.test2{age=24.0} stats: [numFiles=1, numRows=2, totalSize=69, rawDataSize=67]Partition hive.test2{age=29.0} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]MapReduce Jobs Launched:Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.32 sec HDFS Read: 415 HDFS Write: 375 SUCCESSTotal MapReduce CPU Time Spent: 7 seconds 320 msecOKTime taken: 41.846 seconds注意:查询语句select查询出来的age字段必须放在最后,和分区字段对应,不然结果会出错
④ 查看插入结果
hive> select * from test2;OKMary Kake SuZhou Suzhou University 21.0Bill King XuZhou Xuzhou Normal University 23.0John Doe YangZhou YangZhou University 24.0Tom NanJing Nanjing University 24.0Jack NanJing Southeast China University 29.0五、单个查询语句中创建表并加载数据
在实际情况中,表的输出结果可能太多,不适于显示在控制台上,这时候,将Hive的查询输出结果直接存在一个新的表中是非常方便的,我们称这种情况为CTAS(create table .. as select)
(1) 创建表
hive> CREATE TABLE test3 > AS > SELECT name,age FROM test1;Total jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1419317102229_0030, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0030/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0030Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-28 20:59:59,375 Stage-1 map = 0%, reduce = 0%2014-12-28 21:00:10,795 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.68 secMapReduce Total cumulative CPU time: 2 seconds 680 msecEnded Job = job_1419317102229_0030Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-28_20-59-45_494_6763514583931347886-1/-ext-10001Moving data to: hdfs://secondmgt:8020/hive/warehouse/hive.db/test3Table hive.test3 stats: [numFiles=1, numRows=5, totalSize=63, rawDataSize=58]MapReduce Jobs Launched:Job 0: Map: 1 Cumulative CPU: 2.68 sec HDFS Read: 415 HDFS Write: 129 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 680 msecOKTime taken: 26.583 seconds(2) 查看插入结果
hive> select * from test3;OKTom 24.0Jack 29.0Mary Kake 21.0John Doe 24.0Bill King 23.0Time taken: 0.045 seconds, Fetched: 5 row(s)
CTAS操作是原子的,因此如果select查询由于某种原因而失败,新表是不会创建的!
此处结束!
- Hive 五种数据导入方式介绍
- Hive 五种数据导入方式介绍
- Hive四种数据导入方式介绍
- Hive四种数据导入方式介绍
- Hive四种数据导入方式介绍
- Hive四种数据导入方式介绍
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- Hive几种数据导入方式
- hive几种数据导入方式
- Hive四种数据导入方式
- Hive几种数据导入方式
- MySQL 字段类型说明
- 2014-9-28-系统settings-wifi初探
- web前端开发人员技能点汇总、技术详细列表
- Git冲突:commit your changes or stash them before you can merge.
- 分工和分工会恢复规划
- Hive 五种数据导入方式介绍
- 关于手机标准中欧美三方撕逼大战
- 开源免费天气预报接口API以及全国所有地区代码!!(国家气象局提供)
- 使用(ImageMagick+tesseract-ocr)实现图像验证码识别实例
- Activity启动模式详解(二)--->singleTask
- 让他要让他要让他要让他一人头一天
- XMPP
- android UI进阶之style和theme的使用
- MSDTC不可用解决办法