Hive数据导入方案—使用ORC格式存储hive数据

来源：互联网发布：淘宝查小号信誉编辑：程序博客网时间：2024/06/05 00:16

目的：将上网日志导入到hive中，要求速度快，压缩高，查询快，表易维护。推荐使用ORC格式的表存储数据

思路：因为在hive指定RCFile格式的表，不能直接load数据，只能通过textfile表进行insert转换。考虑先建立txtFile格式内部临时表tmp_testp，使用hdfs fs -put命令向tmp_testp表路径拷贝数据（不是load），再建立ORC格式外部表http_orc，使用insert命令把tmp_test表导入http_orc中，最后删除掉临时表数据。过程消耗的时间1.使用put想hdfs上传文件 2.insert表数据（hive转换格式压缩数据）

执行：

1、建立内部临时表，使表的location关联到一个日志文件的文件夹下：

create table IF NOT EXISTS tmp_testp(p_id INT,tm BIGINT,idate BIGINT,phone BIGINT)

partitioned by (dt string)

row format delimited fields terminated by '\,'

location '/hdfs/incoming';

2. 通过hdfs上传文件124G文件，同时手动建立分区映射关系来导入数据。

ALTER TABLE tmp_testp ADD PARTITION(dt='2013-09-30');

hadoop fs -put /hdfs/incoming/*d /hdfs/incoming/dt=2013-09-30

记录耗时： 12:44 - 14：58 =两小时14分钟

上传速度缓慢，内存消耗巨大

Mem: 3906648k total, 3753584k used, 153064k free, 54088k buffers

内存利用率96%

3.测试临时表是否可以直接读取数据

select * from tmp_testp where dt='2013-09-30';

4.建立ORC格式外部表

create external table IF NOT EXISTS http_orc(p_id INT,tm BIGINT,idate BIGINT,phone BIGINT )

partitioned by (dt string)

row format delimited fields terminated by '\,'

stored as orc ;

5.将临时表导入到ORC表中

insert overwrite table http_orc partition(dt='2013-09-30') select p_id,tm,idate,phone from tmp_testp where dt='2013-09-30';

记录耗时：Time taken: 3511.626 seconds = 59分钟，

注意insert这一步，可以选择字段导入到orc表中，达到精简字段，多次利用临时表建立不同纬度分析表的效果，不需要提前处理原始log文件，缺点是上传到hdfs原始文件时间太长

6.计算ORC表压缩率：

HDFS Read: 134096430275 HDFS Write: 519817638 SUCCESS

压缩率：519817638/134096430275=0.386% 哎呀，都压缩没了

7.删除内部临时表，保证hdfs中只存一份ORC压缩后的文件

drop table tmp_testp;

8.简单测试一下表操作看看，ORC压缩表与txtFile不压缩表的性能对比

ORC表执行：select count(*) from http_orc;

469407190
Time taken: 669.639 seconds, Fetched: 1 row(s)

txtFile表执行：select count(*) from tmp_testp;

469407190
Time taken: 727.944 seconds, Fetched: 1 row(s)

ORC效果不错，比txtFile效果好一点点

总结：平均每s上传文件：124G / (2hour14min+59min)= 11M/s

可以清楚看到向hdfs上传文件浪费了大量时间

优化方案：如何提高hdfs文件上传效率

1. 文件不要太大（测试用文件从200m到1G不均），启动多个客户端并行上传文件

2. 考虑减少hive数据副本为2

3. 优化mapReduce及hadoop集群，提高I/O，减少内存使用

参考文章：

为什么要建立内外临时表

http://anyoneking.com/archives/127

为什么要手动put数据代替hive自动load：

Hive中Load Data时多一步Distcp的操作问题，优化集群IO操作

http://blog.sina.com.cn/s/blog_4112736d0101cxeh.html

Hadoop MapReduce之上传文件到HDFS

http://blog.csdn.net/shallowgrave/article/details/7818133

上传文件到HDFS

http://blog.csdn.net/royesir/article/details/5747399

0 0