hive1.2.2+hadoop2.7.3导入米骑测试日志以及数据优化(五）

来源：互联网发布：mac 虚拟机 win10 好吗编辑：程序博客网时间：2024/06/05 02:08

Hive是hadoop连接数据库的一个组件.是一个数据仓库,提供了Hadoop类sql 的增,删,改,查.

hive的表一般跟hdfs路径下的文件对应.hive 的常用命令如下:

启动:

./bin/hive shell

查看所有表:

show tables;

创建表:

create t_1(a int, b int, c int) row format delimited fields terminated by '\t';

修改表:

alter table t_1 add columns(d String);

导入数据:

load data local inpath '/testdata/words.txt' overwrite into table t_1;

导入hdfs中的文件:

load data inpath 'hdfs://master:9000/testdata/words.txt' overwrite into table t_1;

等等...

下面将米骑测试服务器访问日志统计出来的kpi等数据导入进hive的表中.

(1)统计米骑访问日志kpi程序下载链接:

http://download.csdn.net/detail/cafebar123/9889939

(2)创建hive表

先创建2个表,分别代表访问ip次数表:t_ip,访问的上一个跳转链接次数, t_remote_user

然后导入hadoop统计生成的数据,

load data inpath 'hdfs://master:9000/user/hadoop/ipCountOutput/part-r-00000' overwrite into table t_ip;

如图:

此时,t_ip实际上与ti_ip文件夹互相对应.t_remote_user的处理类似与以上.

(3)表的优化

1)下面试着分区表,并试着把米骑测试服务器的日志全部导入进表中.

重新创建一个表,并添加一个partition:

create table t_log(ip String,remote_user String,block1 String,local_time String,time_field String,tie_zone String,request_type String,request String,req_status String,resp_status int,body_bytes_sent Sttp_referer String,user_agent String,req_language String) partitioned by(req_month String) row formaited fields terminated by ' ';

共有13个字段,req_month为partition.

导入日志数据:

load data inpath 'hdfs://master:9000/user/hadoop/miqiLog10000Input/miqizuche10000.log' overwrite int table t_log partition(req_month=0709);

效果:

错误:

ValidationFailureSemanticException table is not partitioned but partition spec exists

这是没有该分区列导致的.如果在创建表时,没有创建与分区名一样的分区列,新增分区时,就会报这bug.

阅读全文

0 0