Postgresql数据库count(distinct)优化

来源：互联网发布：淘宝差评多久会不显示编辑：程序博客网时间：2024/06/15 08:41

基本信息

基本情况
表共800W数据，从260W的结果集中计算出不同的案件数量(130万)，需要执行20多秒
原SQL内容

select count(distinct  c_bh_aj) as ajcount     from db_znspgl.t_zlglpt_wt     where d_cjrq between '20160913' and '20170909';

表信息和数据量

znspgl=# \d+ db_znspgl.t_zlglpt_wt                            Table "db_znspgl.t_zlglpt_wt" Column  |          Type          | Modifiers | Storage  | Stats target | Description ---------+------------------------+-----------+----------+--------------+------------- c_bh    | character(32)          | not null  | extended |              | 编号 c_bh_aj | character(32)          |           | extended |              | 案件编号 n_ajbs  | numeric(15,0)          |           | main     |              | 案件标识 c_zjgz  | character varying(600) |           | extended |              | 质检规则 c_zjxm  | character varying(300) |           | extended |              | 质检项目 d_cjrq  | date                   |           | plain    |              | 创建日期Indexes:    "pk_zlglpt_wt" PRIMARY KEY, btree (c_bh)    "i_t_zlglpt_wt_ajbs" btree (n_ajbs)    "i_t_zlglpt_wt_bh_aj" btree (c_bh_aj)    "i_t_zlglpt_wt_cjrq" btree (d_cjrq)znspgl=# select count(*) from db_znspgl.t_zlglpt_wtznspgl-# ;  count  --------- 8000000(1 row)

数据库版本信息

znspgl=# select version();                                                                 version                                                      -------------------------------------------------------------------------------------------- PostgreSQL 9.5.5 (ArteryBase 3.5.3, Thunisoft). on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17), 64-bit(1 row)

执行计划

znspgl=# explain analyze select count(distinct  c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909';                                                                     QUERY PLAN                                               ------------------------------------------------------------------------------------------------------------------------------ Aggregate  (cost=313357.40..313357.41 rows=1 width=33) (actual time=23478.562..23478.563 rows=1 loops=1)   ->  Bitmap Heap Scan on t_zlglpt_wt  (cost=55811.21..306782.09 rows=2630125 width=33) (actual time=366.909..3946.452 rows=2644330 loops=1)         Recheck Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))         Rows Removed by Index Recheck: 2670504         Heap Blocks: exact=105741 lossy=105694         ->  Bitmap Index Scan on i_t_zlglpt_wt_cjrq  (cost=0.00..55153.68 rows=2630125 width=0) (actual time=341.468..341.468 rows=2644330 loops=1)               Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date)) Planning time: 0.143 ms Execution time: 23478.624 ms

尝试增加覆盖索引

增加索引

create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (d_cjrq,c_bh_aj);

再次查看执行计划

znspgl=# explain analyze select count(distinct  c_bh_aj) as ajcount from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909';                                                                          QUERY PLAN                                          -------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate  (cost=134006.11..134006.12 rows=1 width=33) (actual time=21696.556..21696.557 rows=1 loops=1)   ->  Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt  (cost=0.56..127480.16 rows=2610380 width=33) (actual time=0.055..2684.807 rows=2644330 loops=1)         Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))         Heap Fetches: 0 Planning time: 0.318 ms Execution time: 21696.604 ms

思考
1、SQL速度提升很少！
2、时间主要话费在Aggregate上了，时间从2648一下子升级到21696。
3、理论上200W的count(distinct) 不应该花费19秒那么长时间，而且c_bh_aj还是有序的（建立索引了）

伪loose index scan

从网上看到一片帖子《分析MySQL中优化distinct的技巧》，count distinct 慢的原因是因为扫描编号时会扫描到很多重复的项，可以通过loose index scan避免这些重复的扫描（前提distinct项是有序的！），mysql 和 abase虽然不支持原生的loose index scan（oracle支持），但是可以通过改写SQL达到！

重新建立索引

drop index db_znspgl.i_zlglpt_wt_zh01;create index i_zlglpt_wt_zh01 on db_znspgl.t_zlglpt_wt (c_bh_aj,d_cjrq);

改写SQL

select count(*) from  (   select distinct(c_bh_aj)         from db_znspgl.t_zlglpt_wt        where d_cjrq between '20160913' and '20170909'    ) t;

查看执行计划

znspgl=# explain analyze select count(*) from  (select distinct(c_bh_aj)  from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;                                                                             QUERY PLAN                                       ------------------------------------------------------------------------------------------------------------------------------ Aggregate  (cost=347567.23..347567.24 rows=1 width=0) (actual time=6954.845..6954.846 rows=1 loops=1)   ->  Unique  (cost=0.56..343310.31 rows=340554 width=33) (actual time=0.034..5969.209 rows=1322165 loops=1)         ->  Index Only Scan using i_zlglpt_wt_zh01 on t_zlglpt_wt  (cost=0.56..336784.36 rows=2610380 width=33) (actual time=0.031..2840.502 rows=2644330 loops=1)               Index Cond: ((d_cjrq >= '2016-09-13'::date) AND (d_cjrq <= '2017-09-09'::date))               Heap Fetches: 0 Planning time: 0.172 ms Execution time: 6954.890 ms(7 rows)

通过timing 计算SQL执行时间

znspgl=# \timing onTiming is on.znspgl=#  select count(*) from  (select distinct(c_bh_aj)  from db_znspgl.t_zlglpt_wt where d_cjrq between '20160913' and '20170909' ) t;  count  --------- 1322165(1 row)Time: 1322.715 ms

总结

通过伪loose index scan的SQL处理可以有效提高count(distinct)的执行速度！

阅读全文

0 0