Hive 五种数据导入方式介绍

来源:互联网 发布:js选下一个兄弟节点 编辑:程序博客网 时间:2024/06/05 14:34

问题导读:

        1、Hive五种数据导入方式有哪些?

        2、导入表命令中有无LOCAL关键字的区别?

        3、使用OVERWRITE或INTO关键字的作用及区别?

        4、INPATH路径的限制?

        5、什么是动态分区插入?

        6、动态分区插入需要做哪些设置?

内容:

       既然Hive没有行级别的数据插入、更新和删除操作,那么往表中装载数据的唯一途径就是使用一种”大量“的数据装载操作。我们以如下格式文件演示五种数据导入Hive方式    

Tom         24    NanJing   Nanjing UniversityJack        29    NanJing   Southeast China UniversityMary Kake   21    SuZhou    Suzhou UniversityJohn Doe    24    YangZhou  YangZhou UniversityBill King   23    XuZhou    Xuzhou Normal University

数据格式以\t分隔,分别表示:姓名、年龄、地址、学校

一、从本地文件系统中导入数据

       (1) 创建test1测试表

hive> CREATE TABLE test1(name STRING,age INT, address STRING,school STRING)    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'    > STORED AS TEXTFILE ;OKTime taken: 0.078 seconds

     (2) 从本地加载数据 

hive> LOAD DATA LOCAL INPATH "/home/hadoopUser/data/test1.txt"    > INTO TABLE test1;Copying data from file:/home/hadoopUser/data/test1.txtCopying file: file:/home/hadoopUser/data/test1.txtLoading data to table hive.test1Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]OKTime taken: 0.364 seconds
     (3) 查看导入结果

hive> select * from test1;OKTom     24    NanJing Nanjing UniversityJack    29    NanJing Southeast China UniversityMary Kake       21    SuZhou  Suzhou UniversityJohn Doe        24    YangZhou        YangZhou UniversityBill King       23    XuZhou  Xuzhou Normal UniversityTime taken: 0.533 seconds, Fetched: 5 row(s)
        注意:此处使用的是LOCAL,表示从本地文件系统中加载数据到Hive中,同时没有OVERWRITE关键字,仅仅会把新增的文件增加到目标文件夹而不会删除之前的数据。如果使用OVERWRITE关键字,那么目标文件夹中之前的数据将会被先删除掉。

二、从HDFS文件系统加载数据到Hive

        (1) 清空之前创建的表中数据

insert overwrite table test1  select * from test1 where 1=0;  //清空表,一般不推荐这样操作
        (2) 从HDFS加载数据

hive> LOAD DATA INPATH "/input/test1.txt"    > OVERWRITE INTO TABLE test1;Loading data to table hive.test1rmr: DEPRECATED: Please use 'rm -r' instead.Deleted hdfs://secondmgt:8020/hive/warehouse/hive.db/test1Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]OKTime taken: 0.355 seconds
(3) 查询结果

hive> select * from test1;OKTom     24.0    NanJing Nanjing UniversityJack    29.0    NanJing Southeast China UniversityMary Kake       21.0    SuZhou  Suzhou UniversityJohn Doe        24.0    YangZhou        YangZhou UniversityBill King       23.0    XuZhou  Xuzhou Normal UniversityTime taken: 0.054 seconds, Fetched: 5 row(s)

        注意:此处没有LOCAL关键字,表示分布式文件系统中的路径,这就是和第一种方法的主要区别,同时由日志可以发现,因为此处加了OVERWRITE关键字,执行了Deleted操作,即先删除之前存储的数据,然后再执行加载操作。

       同时,INPATH子句中使用的文件路径还有一个限制,那就是这个路径下不可以包含任何文件夹。

三、通过查询语句向表中插入数据

        (1) 创建test4测试表

hive> CREATE TABLE test4(name STRING,age FLOAT,address STRING,school STRING)    > ROW FORMAT DELIMITED    > FIELDS TERMINATED BY '\t'    > STORED AS TEXTFILE ;OKTime taken: 0.251 seconds
创建表过程基本和前面一样,此处不细讲

       (2) 从查询结果中导入数据

hive> INSERT INTO TABLE test4 SELECT * FROM test1;Total jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1419317102229_0032, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0032/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0032Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-29 10:49:25,409 Stage-1 map = 0%,  reduce = 0%2014-12-29 10:49:36,900 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.83 secMapReduce Total cumulative CPU time: 2 seconds 830 msecEnded Job = job_1419317102229_0032Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-29_10-49-10_009_3039854442660229613-1/-ext-10000Loading data to table hive.test4Table hive.test4 stats: [numFiles=1, numRows=5, totalSize=211, rawDataSize=206]MapReduce Jobs Launched:Job 0: Map: 1   Cumulative CPU: 2.83 sec   HDFS Read: 415 HDFS Write: 278 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 830 msecOKTime taken: 28.495 seconds

        注意:新建表的字段数,一定要和后面SELECT中查询的字段数一样,且要注意数据类型。如test4包含四个字段:name、age、address和school,则SELECT查询出的结果也应该对应这四个字段。

       (3) 查看导入结果

hive> select * from test4;OKTom     24.0    NanJing Nanjing UniversityJack    29.0    NanJing Southeast China UniversityMary Kake       21.0    SuZhou  Suzhou UniversityJohn Doe        24.0    YangZhou        YangZhou UniversityBill King       23.0    XuZhou  Xuzhou Normal UniversityTime taken: 0.066 seconds, Fetched: 5 row(s)
四、分区插入

        分区插入有两种,一种是静态分区,另一种是动态分区。如果混合使用静态分区和动态分区,则静态分区必须出现在动态分区之前。现分别介绍这两种分区插入

       (1) 静态分区插入

创建分区表

hive> CREATE TABLE test2(name STRING,address STRING,school STRING)    > PARTITIONED BY(age float)    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'    > STORED AS TEXTFILE ;OKTime taken: 0.144 seconds

       此处创建了一个test2的分区表,以年龄分区

从查询结果中导入数据      

hive> INSERT INTO  TABLE test2 PARTITION (age='24') SELECT * FROM test1;FAILED: SemanticException [Error 10044]: Line 1:19 Cannot insert into target table because column number/types are different ''24'': Table insclause-0 has 3 columns, but query has 4 columns.
此处报了一个错误。是因为test2中是以age分区的,有三个字段,SELECT * 语句中包含有四个字段,所以出错。正确如下:

hive> INSERT INTO  TABLE test2 PARTITION (age='24') SELECT name,address,school FROM test1;
 查看插入结果

hive> select * from test2;OKTom     NanJing Nanjing University      24.0Jack    NanJing Southeast China University      24.0Mary Kake       SuZhou  Suzhou University       24.0John Doe        YangZhou        YangZhou University     24.0Bill King       XuZhou  Xuzhou Normal University        24.0Time taken: 0.079 seconds, Fetched: 5 row(s)
由查询结果可知,每条记录的年龄均为24,插入成功。

(2) 动态分区插入

       静态分区需要创建非常多的分区,那么用户就需要写非常多的SQL!Hive提供了一个动态分区功能,其可以基于查询参数推断出需要创建的分区名称。

 创建分区表,此过程和静态分区创建表一样,此处省略

 参数设置

hive> set hive.exec.dynamic.partition=true;hive> set hive.exec.dynamic.partition.mode=nonstrict;
注意:动态分区默认情况下是没有开启的。开启后,默认是以”严格“模式执行的,在这种模式下要求至少有一列分区字段是静态的。这有助于阻止因设计错误导致查询产生大量的分区。但是此处我们不需要静态分区字段,估将其设为nonstrict。

 数据动态插入

hive> insert into table test2 partition (age) select name,address,school,age from test1;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1419317102229_0029, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0029/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0029Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12014-12-28 20:45:07,996 Stage-1 map = 0%,  reduce = 0%2014-12-28 20:45:21,488 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.67 sec2014-12-28 20:45:32,926 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.32 secMapReduce Total cumulative CPU time: 7 seconds 320 msecEnded Job = job_1419317102229_0029Loading data to table hive.test2 partition (age=null)        Loading partition {age=29.0}        Loading partition {age=23.0}        Loading partition {age=21.0}        Loading partition {age=24.0}Partition hive.test2{age=21.0} stats: [numFiles=1, numRows=1, totalSize=35, rawDataSize=34]Partition hive.test2{age=23.0} stats: [numFiles=1, numRows=1, totalSize=42, rawDataSize=41]Partition hive.test2{age=24.0} stats: [numFiles=1, numRows=2, totalSize=69, rawDataSize=67]Partition hive.test2{age=29.0} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]MapReduce Jobs Launched:Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.32 sec   HDFS Read: 415 HDFS Write: 375 SUCCESSTotal MapReduce CPU Time Spent: 7 seconds 320 msecOKTime taken: 41.846 seconds
注意:查询语句select查询出来的age字段必须放在最后,和分区字段对应,不然结果会出错
 查看插入结果

hive> select * from test2;OKMary Kake       SuZhou  Suzhou University       21.0Bill King       XuZhou  Xuzhou Normal University        23.0John Doe        YangZhou        YangZhou University     24.0Tom     NanJing Nanjing University      24.0Jack    NanJing Southeast China University      29.0
五、单个查询语句中创建表并加载数据

         在实际情况中,表的输出结果可能太多,不适于显示在控制台上,这时候,将Hive的查询输出结果直接存在一个新的表中是非常方便的,我们称这种情况为CTAS(create table .. as select)

        (1) 创建表

hive> CREATE TABLE test3    > AS    > SELECT name,age FROM test1;Total jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1419317102229_0030, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0030/Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0030Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-28 20:59:59,375 Stage-1 map = 0%,  reduce = 0%2014-12-28 21:00:10,795 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 secMapReduce Total cumulative CPU time: 2 seconds 680 msecEnded Job = job_1419317102229_0030Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-28_20-59-45_494_6763514583931347886-1/-ext-10001Moving data to: hdfs://secondmgt:8020/hive/warehouse/hive.db/test3Table hive.test3 stats: [numFiles=1, numRows=5, totalSize=63, rawDataSize=58]MapReduce Jobs Launched:Job 0: Map: 1   Cumulative CPU: 2.68 sec   HDFS Read: 415 HDFS Write: 129 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 680 msecOKTime taken: 26.583 seconds
(2) 查看插入结果

hive> select * from test3;OKTom     24.0Jack    29.0Mary Kake       21.0John Doe        24.0Bill King       23.0Time taken: 0.045 seconds, Fetched: 5 row(s)

       CTAS操作是原子的,因此如果select查询由于某种原因而失败,新表是不会创建的!

此处结束!

0 0
原创粉丝点击