Hive知识点三(表的属性操作、HQL)

来源：互联网发布：淘宝软件代理商加盟编辑：程序博客网时间：2024/04/28 08:13

1、表的属性操作

1.1、修改表名

alter table oldtable rename to newtable;

1.2、修改表的列名(或描述信息)

alter table tablename change column oldcol newclo string comment '描述’ after col0;(或者使用first放到第一位)

1.3、添加字段

alter table tablename add columns(col1 string,col2 int commect '字段2');

1.4、修改表的属性(查看desc formatted tablename table parameters)

alter table tablename set tblproperties('comment'='xxxxxxx');

1.5、针对无分区表与有分区表修改分隔符不同

修改分隔符(无分区): alter table tablename set serdeproperties('field.delim' = '\t');

修改分隔符(有分区): alter table tablename partition(dt='20160916') set serdeproperties('field.delim' = '\t');

1.6、修改表的 location(修改后删除表时，文件也会被删除)

alter table tablename set location 'hdfs://localhost:9000/location';

分区表:alter table tablename [partition(...)] set location 'hdfs://localhost:9000/location';

1.7、内部表与外部表转换

alter table tablename set tblproperties('external' = 'true')；//内部表转外部表

alter table tablename set tblproperties('external' = 'false')；//外部表转内部表

1.8、动态分区

查看hive的模式: set hive.exec.dynamic.partition.mode;默认是hive.exec.dynamic.partition.mode=strict;（nonstrict不需要指定分区）;当是strict时，插入数据时需要指定分区

2、HQL高级查询

2.1、查询操作group by 、order by、join、distribute by、sort by、cluster by(按照指定键聚合后排序)、union all；底层的实现都是 MapReduce

2.2、简单的聚合操作

count(*) count(1) count(col)计数

sum求和（可转成数字的值返回bigint->sum(col) + cast(1 as bigint)）

avg求平均值（可转成数字的值返回double）

distinct不同值个数count(distinct)

2.3、order by（按照某些字段进行排序，where条件放在map计算尽量不要使用）可以多列进行排序，默认按字典排序、order by为全局排序、order by需要reduce操作，且只有一个reduce，与配置无关

select * from tablename order by col1 desc,col2 asc;

2.4、group by按照某些字段的值进行分组，有相同值放到一起（注意:select后面非聚合列必须出现在group by中除了普通的列就是一些聚合操作,group by后面也可以跟表达式，比如substr(col)）

特性:使用reduce操作，受限于reduce数量，设置reduce参数set mapred.reduce.tasks=5;输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关

问题:网络负载过重；数据倾斜，优化参数set hive.groupby.skewindata=true

slect col,count(1) as num from tablename group by col;

2.5、Join

两个表m，n之间按照on条件连接，m中的一记录和n中的一条记录组成一条新记录；join等值连接，只有某个值在mhen中同时存在时；

left outer join 左外连接，左边表中的值无论是否存在b中存在时，都输出，右边表中的值只有在左边表中存在时才输出；

left semi join类似exists；mapjoin 在Map端完成join操作，不需要用Reduce，基于内存做join，属于优化操作。

2.6、MapJoin

在map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作，其中使用了分布式缓存技术；

优缺点:不消耗集群的Reduce资源；加快程序执行；降低网络负载；

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次；生成较多的小文件；

配置以下参数，是hive自动根据sql，选择使用common join或者map join

set hive.auto.convert.join = true;

hive.mapjoin.smalltable.filesize默认是25m

第二种方式，手动指定:

select /+mapjoin(n)/ m.col,m.col2,n.col3 from m join n on m.col=n.col;

简单总结一下mapjoin的使用场景： 1、关联操作中有一张表非常小；2、不等值的链接操作；

2.7、Distribute by和sort by

Distribute分散数据： distribute by col 按照col列把数据分散到不同reduce(可以设置reduce数量，可以把数据分散到不同文件，同时可以把多个文件合并成一个文件)

Sort排序： sort by col 按照co列l把数据排序

select col1,col2 from m distribute by col1 sort by col1 asc,col2 desc;两者结合出现，确保每个reduce的输出都是有序的

distribute by 与group by对比：

都是按照key值划分数据；都是使用reduce操作；唯一不同，distribute by只是单纯的分散数据，而group by把相同key的数据聚集到一起，后续必须是聚合操作

order by 与sort by

order by是全局排序；sort by只是确保某个reduce上面输出的数据有序，如果只有一个reduce时，和order 不要作用一样

应用场景: map输出的文件大小不均匀；reduce输出文件大小不均匀；小文件过多；文件超大

2.8、cluster by

把有相同值的数据聚集到一起，并排序；效果 cluster by col - > distribute by col order by col

2.9、union all(多个表的数据合并放一个表，hive不支持union)

select col from (select a as col from t1 union all select b as col from t2) tmp;

union all的要求:

1、字段名字一样

2、字段类型一样

3、字段个数一样

4、子表不能有别名

5、如果需要从合并之后的表中查询数据，那么合并的表必须要有别名

0 0

Hive知识点三(表的属性操作、HQL)

1、表的属性操作

1.1、修改表名

alter table oldtable rename to newtable;

1.2、修改表的列名(或描述信息)

alter table tablename change column oldcol newclo string comment '描述’ after col0;(或者使用first放到第一位)

1.3、添加字段

alter table tablename add columns(col1 string,col2 int commect '字段2');

1.4、修改表的属性(查看desc formatted tablename table parameters)

alter table tablename set tblproperties('comment'='xxxxxxx');

1.5、针对无分区表与有分区表修改分隔符不同

修改分隔符(无分区): alter table tablename set serdeproperties('field.delim' = '\t');

修改分隔符(有分区): alter table tablename partition(dt='20160916') set serdeproperties('field.delim' = '\t');

1.6、修改表的 location(修改后删除表时，文件也会被删除)

alter table tablename set location 'hdfs://localhost:9000/location';

分区表:alter table tablename [partition(...)] set location 'hdfs://localhost:9000/location';

1.7、内部表与外部表转换

alter table tablename set tblproperties('external' = 'true')；//内部表转外部表

alter table tablename set tblproperties('external' = 'false')；//外部表转内部表

1.8、动态分区

查看hive的模式: set hive.exec.dynamic.partition.mode;默认是hive.exec.dynamic.partition.mode=strict;（nonstrict不需要指定分区）;当是strict时，插入数据时需要指定分区

2、HQL高级查询

2.1、查询操作group by 、order by、join、distribute by、sort by、cluster by(按照指定键聚合后排序)、union all；底层的实现都是 MapReduce

2.2、简单的聚合操作

count(*) count(1) count(col)计数

sum求和（可转成数字的值返回bigint->sum(col) + cast(1 as bigint)）

avg求平均值（可转成数字的值返回double）

distinct不同值个数count(distinct)

2.3、order by（按照某些字段进行排序，where条件放在map计算尽量不要使用）可以多列进行排序，默认按字典排序 、order by为全局排序、order by需要reduce操作，且只有一个reduce，与配置无关

select * from tablename order by col1 desc,col2 asc;

2.4、group by按照某些字段的值进行分组，有相同值放到一起（注意:select后面非聚合列必须出现在group by中除了普通的列就是一些聚合操作,group by后面也可以跟表达式，比如substr(col)）

特性:使用reduce操作，受限于reduce数量，设置reduce参数set mapred.reduce.tasks=5;输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关

问题:网络负载过重；数据倾斜，优化参数set hive.groupby.skewindata=true

slect col,count(1) as num from tablename group by col;

2.5、Join

两个表m，n之间按照on条件连接，m中的一记录和n中的一条记录组成一条新记录；join等值连接，只有某个值在mhen中同时存在时；

left outer join 左外连接，左边表中的值无论是否存在b中存在时，都输出，右边表中的值只有在左边表中存在时才输出；

left semi join类似exists；mapjoin 在Map端完成join操作，不需要用Reduce，基于 内存做join，属于优化操作。

2.6、MapJoin

在map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作，其中使用了分布式缓存技术；

优缺点:不消耗集群的Reduce资源；加快程序执行；降低网络负载；

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次；生成较多的小文件；

配置以下参数，是hive自动根据sql，选择使用common join或者map join

set hive.auto.convert.join = true;

hive.mapjoin.smalltable.filesize默认是25m

第二种方式，手动指定:

select /*+mapjoin(n)*/ m.col,m.col2,n.col3 from m join n on m.col=n.col;

简单总结一下mapjoin的使用场景： 1、关联操作中有一张表非常小；2、不等值的链接操作；

2.7、Distribute by和sort by

Distribute分散数据： distribute by col 按照col列把数据分散到不同reduce(可以设置reduce数量，可以把数据分散到不同文件，同时可以把多个文件合并成一个文件)

Sort排序： sort by col 按照co列l把 数据排序

select col1,col2 from m distribute by col1 sort by col1 asc,col2 desc;两者结合出现，确保每个reduce的输出都是有序的

distribute by 与group by对比：

都是按照key值划分数据；都是使用reduce操作；唯一不同，distribute by只是单纯的分散数据，而group by把相同key的数据聚集到一起，后续必须是聚合操作

order by 与sort by

order by是全局排序；sort by只是确保某个reduce上面输出的数据有序，如果只有一个reduce时，和order 不要作用一样

应用场景: map输出的文件大小不均匀；reduce输出文件大小不均匀；小文件过多；文件超大

2.8、cluster by

把有相同值的数据聚集到一起，并排序；效果 cluster by col - > distribute by col order by col

2.9、union all(多个表的数据合并放一个表，hive不支持union)

select col from (select a as col from t1 union all select b as col from t2) tmp;

union all的要求:

1、字段名字一样

2、字段类型一样

3、字段个数一样

4、子表不能有别名

5、 如果需要从合并之后的表中查询数据，那么合并的表必须要有别名

2.3、order by（按照某些字段进行排序，where条件放在map计算尽量不要使用）可以多列进行排序，默认按字典排序、order by为全局排序、order by需要reduce操作，且只有一个reduce，与配置无关

left semi join类似exists；mapjoin 在Map端完成join操作，不需要用Reduce，基于内存做join，属于优化操作。

select /+mapjoin(n)/ m.col,m.col2,n.col3 from m join n on m.col=n.col;

Sort排序： sort by col 按照co列l把数据排序

5、如果需要从合并之后的表中查询数据，那么合并的表必须要有别名