Hiveql优化

来源:互联网 发布:对大数据发展认识 编辑:程序博客网 时间:2024/04/20 21:34

hive相关优化方式:

Column Pruning

As name suggests -discard columns which are not needed

> select a,b from t where e<10;

>t contains 5 columns (a,b,c,d,e)

Columns c,d are discarded

Select only the relevant columns

Enabled by defaut

  > hive.optimize.cp=true


Predicate Pushdown

Move predicate closer to the table scan only.

Enabled by default:

hive.optimize.ppd=true

Predicates moved up across joins.

select * from t1 join t2 on(t1.c1=t2.c2 and t1.c1<10)

select * from t1 join t2 on(t1.c1=t2.c2) where t1.c1<10

Special needs for outer joins:

Left outer join: predicates on the left side aliases are pushed

Right outer join:predicates on the right side aliases are pushed

Full outer join:none of the predicates are pushed

Non-deterministic functions(eg.rand()) not pushed.

Use annotation:

> #UDFType(deterministic=false)

The entire expression containing non-deterministic function is not pushed up

> c1>10 and c2<rand()


Partition Pruning

Reduce list of partitions to be scanned

Works on parse tree currently- some known bugs

select * from 

     (select c1, count(1) from t group by c1) subq

where subq.prtn=100;

select * from t1 join

     (select * from t2) subq on(t1.c1=subq.c2)

where subq.prtn=100;

hive.mapred.mode=nonstrict

Strict mode, scan of a complete partitioned table fails.


Hive QL-Join

INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view pv

JOIN user u

ON (pv.userid=u.userid);

Rightmost table streamed - whereas inner tables data is 

kept in memory for a given key. Use largest table as the 

right most table.

hive.mapred.mode=nonstrict

In strict mode,Cartesian product not allowed


INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view p JOIN user u

ON (pv.userid=u.useid)

JOIN new user x on (u.userid=x.useid);

Same join key - merge into 1 map-reduce job -true for any number of tables with the same join key.

1 map-reduce job instead of 'n'

The merging happens for OUTER joins also


INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view p JOIN user u

    ON (pv.userid=u.userid)

     JOIN new user x on (u.age=x.age);

Different join keys - 2 map-reduce jobs

Same as:

INSERT OVERWRITE TABLE tmptable SELECT *

FROM page_view p JOIN user u

  ON(pv.useid=u.userid);

INSERT OVERWRITE TABLE pv_users

SELECT x.pageid,x.age

FROM temptable x JOIN newuser y on(x.age=y.age);


Join Optimizations

Map Joins

 > User specified small tables stored in hash tables on the mapper backed by jdbm

 > No reducer needed

INSERT INTO TABLE pv_users

SELECT /*+MAPJOIN(pv) */ pv.pageid,u.age 

FROM page_view pv JOIN user u

ON(pv.userid=u.userid);


Map Join

>>Optimization phase

>> n-way map-join if(n-1) tables are map side readable

>>Mapper reads all (n-1) table before processing the main table under consideration

>>Map-side readable tables are cached in memory and backed by JDBM persistent hash tables

Parameters

>>hive.join.emit.interval=1000

>>hive.mapjoin.size.key=10000

>>hive.mapjoin.cache.numrows=10000


Future

>>Sizes/statistics to determine join order

>>Sizes to enforce map-join

>>Better techniques for handling skew for a given key

>>Using sorted properties of the table

>>Fragmented joins

>>n-way joins for different join keys by replicating data


Hive QL - Group By

SELECT pageid, age, count(1) 

FROM pv_users

GROUP BY paged,age;


Group by Optimizations

>>Map side partial aggregations

     > Hash-based aggregates

     > Serialized key/values in hash tables

     > 90% speed improvement on Query 

         select count(1) from t;

>> Load balancing for data skew

Parameters

>> hive.map.aggr=true

>> hive.groupby.skewindata=false

>> hive.groupby.mapaggr.checkinterval=100000

>> hive.map.aggr.hash.percentmemory=0.5

>> hive.map.aggr.hash.min.reduction=0.5


Multi GroupBy

FROM pv_users

        INSERT OVERWRITE TABLE pv_gender_sum

             SELECT gender, count(DISTINCT userid),count(userid)

                   GROUP BY gender

        INSERT OVERWRITE TABLE pv_age_sum

              SELECT age, count(DISTINCT userid)

                   GROUP BY age

>>n+1 map-reduce jobs instead of 2n

>> Single scan of input table

>>Same distinct key across all groupies

>>Always user multi-groupby


Merging of small files

>>Lots of small files creates problems for downstream jobs

     > SELECT * FROM T WHERE x<10;

>> hive.merge.mapfiles=true

>> hive.merge.mapredfiles=false

>> hive.merge.size.per.task=256*1000*1000

>> Increases time for current query