Hive几种数据导入方式

来源：互联网发布：java方法内部类编辑：程序博客网时间：2024/05/15 11:52

好久没写Hive的那些事了，今天开始写点吧。今天的话题是总结Hive的几种常见的数据导入方式，我总结为四种：
（1）、从本地文件系统中导入数据到Hive表；
（2）、从HDFS上导入数据到Hive表；
（3）、从别的表中查询出相应的数据并导入到Hive表中；
（4）、在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中。
我会对每一种数据的导入进行实际的操作，因为纯粹的文字让人看起来很枯燥，而且学起来也很抽象。好了，开始操作！

　　一、从本地文件系统中导入数据到Hive表
　　先在Hive里面创建好表，如下：

1hive> create table wyp
2    > (id int, name string,
3    > age int, tel string)
4    > ROW FORMAT DELIMITED
5    > FIELDS TERMINATED BY '\t'
6    > STORED AS TEXTFILE;
7OK
8Time taken: 2.832seconds

这个表很简单，只有四个字段，具体含义我就不解释了。本地文件系统里面有个/home/wyp/wyp.txt文件，内容如下：

1[wyp@master~]$ cat wyp.txt
21      wyp     25     13188888888888
32      test    30     13888888888888
43      zs      34     899314121

wyp.txt文件中的数据列之间是使用\t分割的，可以通过下面的语句将这个文件里面的数据导入到wyp表里面，操作如下：

1hive> load data local inpath 'wyp.txt'into table wyp;
2Copying data from file:/home/wyp/wyp.txt
3Copying file: file:/home/wyp/wyp.txt
4Loading data to table default.wyp
5Table default.wyp stats:
6[num_partitions: 0, num_files: 1, num_rows: 0, total_size: 67]
7OK
8Time taken: 5.967seconds

这样就将wyp.txt里面的内容导入到wyp表里面去了（关于这里面的执行过程大家可以参见本博客的《Hive表与外部表》），可以到wyp表的数据目录下查看，如下命令：

1hive> dfs -ls /user/hive/warehouse/wyp ;
2Found 1items
3-rw-r--r--3wyp supergroup 672014-02-1918:23/hive/warehouse/wyp/wyp.txt

数据的确导入到wyp表里面去了。

　　和我们熟悉的关系型数据库不一样，Hive现在还不支持在insert语句里面直接给出一组记录的文字形式，也就是说，Hive并不支持INSERT INTO …. VALUES形式的语句。

　　二、HDFS上导入数据到Hive表
　　从本地文件系统中将数据导入到Hive表的过程中，其实是先将数据临时复制到HDFS的一个目录下（典型的情况是复制到上传用户的HDFS home目录下,比如/home/wyp/），然后再将数据从那个临时目录下移动（注意，这里说的是移动，不是复制！）到对应的Hive表的数据目录里面。既然如此，那么Hive肯定支持将数据直接从HDFS上的一个目录移动到相应Hive表的数据目录下，假设有下面这个文件/home/wyp/add.txt，具体的操作如下：

1[wyp@master/home/q/hadoop-2.2.0]$ bin/hadoop fs -cat /home/wyp/add.txt
25      wyp1    23     131212121212
36      wyp2    24     134535353535
47      wyp3    25     132453535353
58      wyp4    26     154243434355

　　上面是需要插入数据的内容，这个文件是存放在HDFS上/home/wyp目录（和一中提到的不同，一中提到的文件是存放在本地文件系统上）里面，我们可以通过下面的命令将这个文件里面的内容导入到Hive表中，具体操作如下：

01hive> load data inpath '/home/wyp/add.txt'into table wyp;
02Loading data to table default.wyp
03Table default.wyp stats:
04[num_partitions: 0, num_files: 2, num_rows: 0, total_size: 215]
05OK
06Time taken: 0.47seconds
07 
08hive> select * from wyp;
09OK
105      wyp1    23     131212121212
116      wyp2    24     134535353535
127      wyp3    25     132453535353
138      wyp4    26     154243434355
141      wyp     25     13188888888888
152      test    30     13888888888888
163      zs      34     899314121
17Time taken: 0.096seconds, Fetched: 7row(s)

　　从上面的执行结果我们可以看到，数据的确导入到wyp表中了！请注意load data inpath ‘/home/wyp/add.txt’ into table wyp;里面是没有local这个单词的，这个是和一中的区别。

　　三、从别的表中查询出相应的数据并导入到Hive表中
　　假设Hive中有test表，其建表语句如下所示：

01hive> create table test(
02    > id int, name string
03    > ,tel string)
04    > partitioned by
05    > (age int)
06    > ROW FORMAT DELIMITED
07    > FIELDS TERMINATED BY '\t'
08    > STORED AS TEXTFILE;
09OK
10Time taken: 0.261seconds

　　大体和wyp表的建表语句类似，只不过test表里面用age作为了分区字段（关于什么是分区字段，请参见本博客的《Hive的数据存储模式》中的介绍，其详细的介绍本博客将会在接下来的时间内介绍，请关注本博客！）。下面语句就是将wyp表中的查询结果并插入到test表中：

01hive> insert into table test
02    > partition (age='25')
03    > select id, name, tel
04    > from wyp;
05#####################################################################
06           这里输出了一堆Mapreduce任务信息，这里省略
07#####################################################################
08Total MapReduce CPU Time Spent: 1seconds 310msec
09OK
10Time taken: 19.125seconds
11 
12hive> select * from test;
13OK
145      wyp1    131212121212   25
156      wyp2    134535353535   25
167      wyp3    132453535353   25
178      wyp4    154243434355   25
181      wyp     13188888888888 25
192      test    13888888888888 25
203      zs      899314121      25
21Time taken: 0.126seconds, Fetched: 7row(s)

　　通过上面的输出，我们可以看到从wyp表中查询出来的东西已经成功插入到test表中去了！如果目标表（test）中不存在分区字段，可以去掉partition (age=’25’)语句。当然，我们也可以在select语句里面通过使用分区值来动态指明分区：

01hive> set hive.exec.dynamic.partition.mode=nonstrict;
02hive> insert into table test
03    > partition (age)
04    > select id, name,
05    > tel, age
06    > from wyp;
07#####################################################################
08           这里输出了一堆Mapreduce任务信息，这里省略
09#####################################################################
10Total MapReduce CPU Time Spent: 1seconds 510msec
11OK
12Time taken: 17.712seconds
13 
14 
15hive> select * from test;
16OK
175      wyp1    131212121212   23
186      wyp2    134535353535   24
197      wyp3    132453535353   25
201      wyp     13188888888888 25
218      wyp4    154243434355   26
222      test    13888888888888 30
233      zs      899314121      34
24Time taken: 0.399seconds, Fetched: 7row(s)

　　这种方法叫做动态分区插入，但是Hive中默认是关闭的，所以在使用前需要先把hive.exec.dynamic.partition.mode设置为nonstrict。当然，Hive也支持insert overwrite方式来插入数据，从字面我们就可以看出，overwrite是覆盖的意思，是的，执行完这条语句的时候，相应数据目录下的数据将会被覆盖！而insert into则不会，注意两者之间的区别。例子如下：

1hive> insert overwrite table test
2    > PARTITION (age)
3    > select id, name, tel, age
4    > from wyp;

　　更可喜的是，Hive还支持多表插入，什么意思呢？在Hive中，我们可以把insert语句倒过来，把from放在最前面，它的执行效果和放在后面是一样的，如下：

01hive> show create table test3;
02OK
03CREATE  TABLE test3(
04  id int,
05  name string)
06Time taken: 0.277seconds, Fetched: 18row(s)
07 
08hive> from wyp
09    > insert into table test
10    > partition(age)
11    > select id, name, tel, age
12    > insert into table test3
13    > select id, name
14    > where age>25;
15 
16hive> select * from test3;
17OK
188      wyp4
192      test
203      zs
21Time taken: 4.308seconds, Fetched: 3row(s)

　　可以在同一个查询中使用多个insert子句，这样的好处是我们只需要扫描一遍源表就可以生成多个不相交的输出。这个很酷吧！

　　四、在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中
　　在实际情况中，表的输出结果可能太多，不适于显示在控制台上，这时候，将Hive的查询输出结果直接存在一个新的表中是非常方便的，我们称这种情况为CTAS（create table .. as select）如下：

01hive> create table test4
02    > as
03    > select id, name, tel
04    > from wyp;
05 
06hive> select * from test4;
07OK
085      wyp1    131212121212
096      wyp2    134535353535
107      wyp3    132453535353
118      wyp4    154243434355
121      wyp     13188888888888
132      test    13888888888888
143      zs      899314121
15Time taken: 0.089seconds, Fetched: 7row(s)

　　数据就插入到test4表中去了，CTAS操作是原子的，因此如果select查询由于某种原因而失败，新表是不会创建的！
　　好了，很晚了，今天就到这，洗洗睡！2014年2月20日 00:59:17

0 0