Hive表中四种不同数据导出方式以及如何自定义导出列分隔符

来源：互联网发布：八旗军战斗力知乎编辑：程序博客网时间：2024/05/21 06:23

文章转载于http://blog.csdn.net/niityzu/article/details/42238483

问题导读：

1、Hive表数据四种导出方式是？
2、导出命令中LOCAL的作用及有无的区别？
3、导出命令中是否可以向导入命令一样使用INTO？
4、如何自定义导出文件的列分隔符？
5、hive的-e和-f参数的作用及如何使用其来导出数据？
6、hive shell环境中source命令的作用？

我们以test1表为例演示几种数据导出方式，test1表中的数据内容如下：

[html] view plain copy
hive> select * from test1;  
OK  
Tom     24.0    NanJing Nanjing University  
Jack    29.0    NanJing Southeast China University  
Mary Kake       21.0    SuZhou  Suzhou University  
John Doe        24.0    YangZhou        YangZhou University  
Bill King       23.0    XuZhou  Xuzhou Normal University  
Time taken: 0.064 seconds, Fetched: 5 row(s)  

一、导出数据至本地文件系统

(1) 导出数据

[html] view plain copy
hive> insert overwrite local directory "/home/hadoopUser/data"  
    > select name,age,address  
    > from test1;  

/home/hadoopUser/data是一个目录，一个文件或者多个文件将会被写到该目录下，具体个数取决于调用的reducer的个数。此处的local表示导出到本地文件系统，如果不加local，则表示导出到分布式文件系统。
(2) 查看导出的数据

[html] view plain copy
Tom^A24.0^ANanJing  
Jack^A29.0^ANanJing  
Mary Kake^A21.0^ASuZhou  
John Doe^A24.0^AYangZhou  
Bill King^A23.0^AXuZhou  

在本地文件系统目录/home/hadoopUser/data中生成了一个000000_0文件，里面的内容如上所示。每个列以^A(八进制\001)分隔。

二、导出数据至分布式文件系统

(1) 导出数据

[html] view plain copy
hive> insert overwrite  directory "/output"  
    > select name,age,address  
    > from test1;  

此处与第一种方法的不同之处就在于没有local关键字，目录/output为分布式文件系统的一个目录，也可以写全：hdfs://master-server/output

(2) 查看导出的数据

结果和第一种方法一样

注意：如果我们将overwrite改成into，就像数据导入表一样，结果会怎么样。

[html] view plain copy
hive> insert into local directory "/home/hadoopUser/data"  
    > select name,age,address  
    > from test1;  
NoViableAltException(85@[184:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])  
        at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)  
        at org.antlr.runtime.DFA.predict(DFA.java:116)  
        at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4945)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:40208)  
        at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.tableOrPartition(HiveParser_IdentifiersParser.java:10233)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.tableOrPartition(HiveParser.java:40210)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.insertClause(HiveParser.java:39685)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:37647)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:36898)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:36774)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1338)  
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)  
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)  
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)  
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)  
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)  
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)  
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)  
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)  
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)  
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)  
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)  
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)  
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)  
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)  
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)  
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)  
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
        at java.lang.reflect.Method.invoke(Method.java:606)  
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)  
FAILED: ParseException line 1:12 missing TABLE at 'local' near 'into' in table name  

和Hive数据导入不一样，不能使用INSERT INTO命令导出表中数据。

三、将表中数据导出至另外一个表中

此种方法与我写的另外一篇博客内容相似，此处不再累赘。数据导入表中的一种方法，即如何将查询结果插入表中，附上链接：通过查询语句向表中插入数据

四、hive -e和-f参数的使用及数据导出

(1) 使用hive -e数据导出

用户可能有时候期望执行一个或者多个查询（使用分号分隔），执行结束后hive CLI立即退出。hive使用-e 命令在终端执行一条或者多条语句，语句使用引号引起来，还可以将查询的结果重定向到一个文件中。

注意：不需要在Hive shell中执行，直接在终端执行即可。

需要先使用"use hive"进入对应的数据库，否则会报错。如下

[html] view plain copy
[hadoopUser@secondmgt ~]$ hive -e "use hive;select * from test1" > /home/hadoopUser/data/myquery  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize  
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed  
  
Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties  
OK  
Time taken: 0.597 seconds  
OK  
Time taken: 0.66 seconds, Fetched: 5 row(s)  

查看导出文件结果

[html] view plain copy
Tom     24.0    NanJing Nanjing University  
Jack    29.0    NanJing Southeast China University  
Mary Kake       21.0    SuZhou  Suzhou University  
John Doe        24.0    YangZhou        YangZhou University  
Bill King       23.0    XuZhou  Xuzhou Normal University  

注意：默认是以\t作为列分隔符的。

(2) 使用hive -f 执行查询文件

有时候我们将多条查询语句写入到一个文件中，使用hive -f 可以执行该文件中的一条或者多条语句。按照惯例，一般把这些Hive查询文件保存为具有.q或者.hql后缀名的文件。

编辑query.hql，内容如下：

[html] view plain copy
use hive;  
select name,age from test1;  

执行命令及过程如下：

[html] view plain copy
[hadoopUser@secondmgt ~]$ hive -f data/query.hql  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize  
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed  
  
Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties  
OK  
Time taken: 0.564 seconds  
Total jobs = 1  
Launching Job 1 out of 1  
Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1419317102229_0038, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0038/  
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0038  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2014-12-29 19:07:55,168 Stage-1 map = 0%,  reduce = 0%  
2014-12-29 19:08:06,671 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec  
MapReduce Total cumulative CPU time: 2 seconds 930 msec  
Ended Job = job_1419317102229_0038  
MapReduce Jobs Launched:  
Job 0: Map: 1   Cumulative CPU: 2.93 sec   HDFS Read: 415 HDFS Write: 63 SUCCESS  
Total MapReduce CPU Time Spent: 2 seconds 930 msec  
OK  
Tom     24.0  
Jack    29.0  
Mary Kake       21.0  
John Doe        24.0  
Bill King       23.0  
Time taken: 27.17 seconds, Fetched: 5 row(s)  

(3) 在Hive shell中使用source命令执行一个脚本文件

[html] view plain copy
hive> source /home/hadoopUser/data/query.hql;  
OK  
Time taken: 0.394 seconds  
Total jobs = 1  
Launching Job 1 out of 1  
Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1419317102229_0039, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0039/  
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0039  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2014-12-29 19:13:03,718 Stage-1 map = 0%,  reduce = 0%  
2014-12-29 19:13:15,459 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.7 sec  
MapReduce Total cumulative CPU time: 2 seconds 700 msec  
Ended Job = job_1419317102229_0039  
MapReduce Jobs Launched:  
Job 0: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 415 HDFS Write: 63 SUCCESS  
Total MapReduce CPU Time Spent: 2 seconds 700 msec  
OK  
Tom     24.0  
Jack    29.0  
Mary Kake       21.0  
John Doe        24.0  
Bill King       23.0  
Time taken: 27.424 seconds, Fetched: 5 row(s)  

附：

自定义导出文件列分隔符

在默认情况下，从Hive表中导出的文件是以^A作为列分隔符的，但是有时候此方法分隔显示的不直观，为此Hive允许指定列的分隔符。我们还是以上面第一种方法导出表中数据为例。

(1) 导出表中数据

[html] view plain copy
hive> insert overwrite local directory "/home/hadoopUser/data"  
    > row format delimited  
    > fields terminated by '\t'  
    > select * from test1;  

此处是以\t作为列的分隔符的。

(2) 查看导出结果

[html] view plain copy
Tom     24.0    NanJing Nanjing University  
Jack    29.0    NanJing Southeast China University  
Mary Kake       21.0    SuZhou  Suzhou University  
John Doe        24.0    YangZhou        YangZhou University  
Bill King       23.0    XuZhou  Xuzhou Normal University  

和第一种方法默认以^A导出相比，显得更直观了一些

0 0