hive 学习笔记

来源：互联网发布：淘宝司法拍卖后悔报名编辑：程序博客网时间：2024/06/03 15:46

//mapjoin的使用应用场景：1.关联操作中有一张表非常小2.不等值的链接操作
//a是小表，b是大表
select /*+ MAPJOIN(a) */ a.gid,a.ip,b.bfd_gid,b.cid from TB_A as a join TB_B as b on (b.l_date='2016-08-03' and a.gid=b.gid)

//hive 复制一个空表：
CREATE TABLE 表1 LIKE 表2;
CREATE TABLE TB_A LIKE TB_B

//压缩对比
lz4无疑是速度最快的.snappy 也相当不错.
snappy压缩速度要快于lz4,但是lz4解压缩速度快了snappy一大截.
然后就是zlib.在这里面压缩率是最高的.但速度就逊色于上面的2款.
最后是aplib,速度最慢,但是压缩PE文件似乎很拿手.但压缩算法是闭源的.

//hive orc格式压缩
stored as orc tblproperties ('orc.compress'='SNAPPY')
默认参数：NONE（不压缩）, ZLIB, SNAPPY
参考：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
http://www.iteblog.com/archives/1014
http://www.tuicool.com/articles/Jreaei

//hive中的collect_set函数使用
select col1,col2,concat_ws(',',collect_set(col3)) from TB_N group by col1,col2;
参考：http://my.oschina.net/repine/blog/295961
http://wangjunle23.blog.163.com/blog/static/117838171201310222309391/
http://stackoverflow.com/questions/6445339/collect-set-in-hive-keep-duplicates

//hive添加分区
alter table TB_N add IF NOT EXISTS partition (l_date='$DATE');

//hive中的like
参考：http://www.xuebuyuan.com/1932675.html
”_”表示任意单个字符，而字符”%”表示任意数量的字符。
select1 from lxw_dual where 'football' like 'foot%';

//设置hive执行引擎
set hive.execution.engine=mr;

//删除表分区
ALTER TABLE TB_N DROP IF EXISTS PARTITION (dt='2008-08-08');

//Hive load数据
注意，1. load数据的时候，尽量不要load一个目录，而是一个目录下的文件。
2. hive表的表名都是默认小写存储的，底层文件名也都是小写的。
3. 对于底层存储为空的（非 /N），则查找方式为：select * from TableName where em=''
LOAD DATA INPATH '$inPutPath/part-*' OVERWRITE INTO TABLE TB_N
LOAD DATA LOCAL INPATH '/home/admin/test/test.txt' OVERWRITE INTO TABLE test_1 PARTITIONPARTITION(l_date='$date')
[OVERWRITE]意思是是覆盖原表里的数据，不写则不会覆盖。
[LOCAL]是指你加载文件的来源为本地文件，不写则为hdfs的文件。
参考：http://blog.csdn.net/wacthamu/article/details/40744217

//Hive中把|替换为,
regexp_replace(t1.dg_info[3],'\\\\|',',')

//对于空值null和''的判断
1. 数字和数字类型用可以用！=
2. 带引号的数字和数字类型可以用！= 比较：
3. 带引号的数字和带引号数字类型可以用！= 比较：
4. 字符串和数字类型不可以用！=比较：
5. 字符串和数字类型不可以用 <> 比较：
6. 空值判断: IS NULL 操作类型: 所有类型
7. 非空判断: IS NOTNULL 操作类型: 所有类型

// 查看Hive表中的数据路径
select INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE from TB_A where l_date='2016-05-06' limit 200;

// hive建表语句
hql="create table if not exists TB_N(
gid string comment 'gid',
method string comment 'method',
cust_item string comment 'customer+iid',
time_stamp string comment 'timestamp'
)
partitioned by (l_date string)
row format delimited
fields terminated by '\t' //字段与字段之间的分隔符
collection items terminated by ',';"

//limit 必须放在最后，不能放到嵌套查询中间
错误写法：
hql="select * from
( select concat('ck:',gid) as gid,method,concat(customer,'+',iid) as cust_item,
unix_timestamp(concat('${date}',' 23:59:59')) as time_stamp
from TB_N where l_date='${date}' limit 100
) as a "
hive -e "$hql"
正确写法：
hql="select * from
( select concat('ck:',gid) as gid,method,concat(customer,'+',iid) as cust_item,
unix_timestamp(concat('${date}',' 23:59:59')) as time_stamp
from TB_N where l_date='${date}'
) as a limit 100"
hive -e "$hql"

//shell中传递参数到Hive语句中
MEvent="MEvent"
hql="select * from TB_N where method='${MEvent}'"
hive -e "$hql"

//插入分区字段
partion的终极解释：partion的字段不属于表结构的内容，他根本不存在表结构中，它的存在只是分区用。
INSERT into TABLE TB_A PARTITION (dt='...') select gid,method,cust_item,time_stamp from TB_B
hdfs文件目录下会有以他为名字的目录！所以分区根本字段根本不用插入。TB_A 一共有4个字段gid,method,cust_item,time_stamp和一个分区字段dt。
INSERT into TABLE TB_A PARTITION (dt='...') select gid,method,cust_item,time_stamp from TB_B
参考：http://www.2cto.com/kf/201210/160777.html

//清空非分区表、清空分区表
insert overwrite table TB_a select * from TB_A where 1=0;
insert overwrite table TB_a PARTITION (dt='2015-08-30') select gid,method ,cust_item,time_stamp from TB_a where 1=0

//日期--时间戳
select unix_timestamp('2016-05-04 23:59:59') as time_stamp from TB_N limit 10
select count(*) from TB_N where from_unixtime(cast(update_time as int),'yyyy-MM-dd') between '2016-04-26' and '2016-07-26'

//日期增加和减少函数
date_add()、date_sub()
如:select date_sub('2012-12-08',10) from TB_N ;
:select date_sub('2012-12-08 10:03:01',10) from TB_N ;
返回都是"2012-11-28"

//if语句
如：select if(category <> '','2',null) as cate_level

//字符串拼接：concat_ws() 和 concat() 函数：
如：select concat_ws('$',split(a.category,',')[0],split(a.category,',')[1]) as category
如：select concat(split(a.category,',')[0],'$',split(a.category,',')[1]) as category

//分区查询，输出10条
select * from TB_N where l_date='2015-08-30' limit 10;

//结果不为空
select * from TB_N where category <> '' limit 100
select * from TB_N where category is not null limit 100

//插入查询到的语句
insert overwrite table TB_N select * from TN_B where XXX limit 30;

//hive删除表格
DROP TABLE IF EXISTS employee;

//Hive 表删除部分数据
insert overwrite table TB_N select * from TB_N where XXXX;
xxx是你需要保留的数据的查询条件

//关于转义"//"和"////"
http://blog.csdn.net/lxpbs8851/article/details/18712407
hive -e ".... split('192.168.0.1','\\\\.') ... " 不然得到的值是null

//通配符
hql="select * from TB_N where category like '%$%$%'"

//建表语句
hive -e "
CREATE EXTERNAL TABLE IF NOT EXISTS TB_N(
gid string ,
latend_category_list String
)
PARTITIONED BY (l_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as TextFile
LOCATION '/user/bre/dtf/latend_result_out';"

//insert into和insert overwrite区别
insert into table TB_N 数据只做增加操作；
insert overwrite table TB_N 将删除当前指定的数据存储目录的所有数据（只会删除指定分区数据不会删除其他分区的数据），再导入新的数据。

//判断hive表是分区表,并拿到分区列的列名
show create table 表名;
如果有分区的话，可以看到显示的partition，partition里就是分区列名。

//SQL 查询不等于某个字符串
字段名<>'字符串'
字段名!='字符串'

//查看表的字段
desc mid_up_item_profile

//insert overwrite 和 insert into 有什么区别
insert overwrite 会覆盖已经存在的数据，我们假设要插入的数据和已经存在的N条数据一样，那么插入后只会保留一条数据；
insert into 只是简单的copy插入，不做重复性校验，如果插入前有N条数据和要插入的数据一样，那么插入后会有N+1条数据；

//Union和Union All到底有什么区别
去重和不去重的区别
参考：http://blog.csdn.net/yenange/article/details/7169654

0 0