hash join优化
来源:互联网 发布:出库软件电脑 编辑:程序博客网 时间:2024/06/05 12:07
hash join --http://lizhen3708693.iteye.com/blog/1631360
acces与filter --http://blog.csdn.net/kkdelta/article/details/7938653
通过某种方式定为了需要的数据,然后读取出这些结果集,叫做access
把所有的数据都访问了,然后过滤掉不需要的数据,这种方式叫做filter
前两天解决了一个优化SQL的case,SQL语句如下,big_table为150G大小,small_table很小,9000多条记录,不到1M大小,hash_area_size, sort_area_size均设置足够大,可以进行optimal hash join和memory sort。
select
/*+ leading(b) use_hash(a b) */
distinct
a.ID
from
BIG_TABLE a, SMALL_TABLE b
where
(a.category = b.from_cat
or
a.category2 = b.from_cat)
and
a.site_id = b.site_id
and
a.sale_end >= sysdate;
执行计划如下:
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 174 | 18 (17)|
| 1 | SORT UNIQUE | | 2 | 174 | 18 (17)|
|* 2 | HASH JOIN | | 2 | 174 | 17 (12)|
| 3 | TABLE ACCESS FULL | SMALL_TABLE | 1879 | 48854 | 14 (8)|
|* 4 | TABLE ACCESS FULL | BIG_TABLE | 4 | 244 | 3 (34)|
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."SITE_ID"="B"."SITE_ID")
filter("A"."CATEGORY"="B"."FROM_CAT" OR
"A"."CATEGORY2"="B"."FROM_CAT")
4 - filter("A"."SALE_END">=SYSDATE@!)
粗略来看,PLAN非常的完美,SQL HINT写的也很到位,小表在内build hash table,大表在外进行probe操作,根据经验来看,整个SQL执行的时间应该和FTS(FullTable Scan) BIG_TABLE的时间差不多。
但是FTS BIG_TABLE的时间大约是8分钟,而真个SQL执行的时间长达3~4小时。
那么问题究竟出在哪里?
FTS时间应该不会有太大变化,那么问题应该在hash join,设置event来trace一下hash join的过程:
alter
session
set
events
'10104 trace name context forever, level 2'
;
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
# than the true number.
Number of buckets with 0 rows: 16373
Number of buckets with 1 rows: 0
Number of buckets with 2 rows: 0
Number of buckets with 3 rows: 1
Number of buckets with 4 rows: 0
Number of buckets with 5 rows: 0
Number of buckets with 6 rows: 0
Number of buckets with 7 rows: 1
Number of buckets with 8 rows: 0
Number of buckets with 9 rows: 0
Number of buckets with between 10 and 19 rows: 1
Number of buckets with between 20 and 29 rows: 1
Number of buckets with between 30 and 39 rows: 3
Number of buckets with between 40 and 49 rows: 0
Number of buckets with between 50 and 59 rows: 0
Number of buckets with between 60 and 69 rows: 0
Number of buckets with between 70 and 79 rows: 0
Number of buckets with between 80 and 89 rows: 0
Number of buckets with between 90 and 99 rows: 0
Number of buckets with 100 or more rows: 4
### Hash table overall statistics ###
Total buckets: 16384 Empty buckets: 16373 Non-empty buckets: 11
Total number of rows: 9232
Maximum number of rows in a bucket: 2531
Average number of rows in non-empty buckets: 839.272705
仔细看,在一个bucket中最多的行数竟然有2531行,因为bucket中是一个链表的结构,所以这几千行都是串在一个链表上。
由这一点想到这个Hash Table所依赖的hash key的distinct value可能太少,重复值太多。否则不应该会有这么多行在同一个bucket里面。
因为Join条件里面有两个列from_cat和site_id,穷举法有三种情况:
SQL>
select
site_id,from_cat,
count
(*)
from
SMALL_TABLE
group
by
site_id,from_cat
having
count
(*)>100;
no
rows
selected
2. Build hash
table
based
on
(from_cat):
SQL>
select
from_cat,
count
(*)
from
SMALL_TABLE
group
by
from_cat
having
count
(*)>100;
no
rows
selected
3. Build hash
table
based
on
(site_id):
SQL>
select
site_id,
count
(*)
from
SMALL_TABLE
group
by
site_id
having
count
(*)>100;
SITE_ID
COUNT
(*)
---------- ----------
0 2531
2 2527
146 1490
210 2526
到这里可以发现,基于site_id这种情况和trace file中这两行很相符:
Number of buckets with 100 or more rows: 4
Maximum number of rows in a bucket: 2531
注:这判断过程可以从执行计划的“Predicate Information”部分看出:
access("A"."SITE_ID"="B"."SITE_ID")
所以推断这个hash table是基于site_id而建的,而Big_Table中大量的行site_id=0,都落在这个linked list最长的bucket中,而大部分行都会扫描完整个链表而最后被丢弃掉,所以这个Hash Join的操作效率非常差,几乎变为了Nest Loop操作。
找到了根本原因,问题也就迎刃而解了。
理想状况下,hash table应当建立于(site_id,from_cat)上,那么问题肯定出在这个OR上,把OR用UNION改写:
select
/*+ leading(b) use_hash(a b) */
distinct
a.ID
from
BIG_TABLE a, SMALL_TABLE b
where
a.category = b.from_cat
and
a.site_id = b.site_id
and
a.sale_end >= sysdate
UNION
select
/*+ leading(b) use_hash(a b) */
distinct
a.ID
from
BIG_TABLE a, SMALL_TABLE b
where
a.category2 = b.from_cat
and
a.site_id = b.site_id
and
a.sale_end >= sysdate;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 148 | 36 (59)|
| 1 | SORT UNIQUE | | 2 | 148 | 36 (59)|
| 2 | UNION-ALL | | | | |
|* 3 | HASH JOIN | | 1 | 74 | 17 (12)|
| 4 | TABLE ACCESS FULL| SMALL_TABLE | 1879 | 48854 | 14 (8)|
|* 5 | TABLE ACCESS FULL| BIG_TABLE | 4 | 192 | 3 (34)|
|* 6 | HASH JOIN | | 1 | 74 | 17 (12)|
| 7 | TABLE ACCESS FULL| SMALL_TABLE | 1879 | 48854 | 14 (8)|
|* 8 | TABLE ACCESS FULL| BIG_TABLE | 4 | 192 | 3 (34)|
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("A"."CATEGORY"="B"."FROM_CAT" AND
"A"."SITE_ID"="B"."SITE_ID")
5 - filter("A"."SALE_END">=SYSDATE@!)
6 - access("A"."CATEGORY2"="B"."FROM_CAT" AND
"A"."SITE_ID"="B"."SITE_ID")
8 - filter("A"."SALE_END">=SYSDATE@!)
初看这个PLAN好像不如第一个PLAN,因为执行了两次BIG_TABLE的FTS,但是让我们在来看看HASH TABLE的结构
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
# than the true number.
Number of buckets with 0 rows: 9306
Number of buckets with 1 rows: 5310
Number of buckets with 2 rows: 1436
Number of buckets with 3 rows: 285
Number of buckets with 4 rows: 43
Number of buckets with 5 rows: 4
Number of buckets with 6 rows: 0
Number of buckets with 7 rows: 0
Number of buckets with 8 rows: 0
Number of buckets with 9 rows: 0
Number of buckets with between 10 and 19 rows: 0
Number of buckets with between 20 and 29 rows: 0
Number of buckets with between 30 and 39 rows: 0
Number of buckets with between 40 and 49 rows: 0
Number of buckets with between 50 and 59 rows: 0
Number of buckets with between 60 and 69 rows: 0
Number of buckets with between 70 and 79 rows: 0
Number of buckets with between 80 and 89 rows: 0
Number of buckets with between 90 and 99 rows: 0
Number of buckets with 100 or more rows: 0
### Hash table overall statistics ###
Total buckets: 16384 Empty buckets: 9306 Non-empty buckets: 7078
Total number of rows: 9232
Maximum number of rows in a bucket: 5
Average number of rows in non-empty buckets: 1.304323
这就是我们所需要的Hash Table,最长的链表只有五行数据。
整个SQL的执行时间从三四个小时缩短为16分钟,大大超出了developer的预期。
这个SQL单纯从PLAN上很难看出问题所在,需要了解Hash Join的机制,进行更深一步的分析。
- hash join优化
- [Oracle] Sql优化系列--Hash join详解
- group by hash join优化案例
- 对Hash Join的一次优化
- HASH JOIN
- hash join
- Hash Join
- 数据库性能优化一例之Hash Join
- 【Oracle优化笔记】哈希连接(HASH JOIN)详解
- Oracle Hash Join的代价模型及优化
- oracle 优化小技巧, filter 与 hash join 连接
- 对Hash Join的一次优化之--10104 trace-bucket
- 11 哈希连接(HASH JOIN)--优化主题系列
- 14 外连接(hash join outer)--优化主题系列
- Sql优化(一) Merge Join vs. Hash Join vs. Nested Loop
- SQL优化(一) Merge Join VS. Hash Join VS. Nested Loop
- hash join VS merge join
- Oracle优化器、优化模式、表的连接方式(Hash Join、Nested Loop、Sort Merge Join)
- How to Use Custom TTF Font on iOS
- hadoop2.0 HDFS搭建和HA切换
- GridView绑定,编辑,更新,取消,删除,排序
- 让UITableView响应touch事件
- android图片的裁剪
- hash join优化
- 弹出式菜单PopMenu
- test
- 分类器是如何做检测的
- linux单机环境下利用openmpi实现矩阵乘的并行计算
- Android 中tomcat搭建本地服务器
- ORA-600 -3005和ORA-600-2662/2663
- JSONKit解析Unicode字符\u0000内容出错的总结
- [HLS]HTTP Live Streaming流与TS流比较