HAWQ取代传统数仓实践(十一)——维度表技术之维度合并

来源:互联网 发布:安卓python编程软件 编辑:程序博客网 时间:2024/06/08 16:53
        有一种合并维度的情况,就是本来属性相同的维度,因为某种原因被设计成重复的维度属性。例如,在销售订单示例中,随着数据仓库中维度的增加,我们会发现有些通用的数据存在于多个维度中。客户维度的客户地址相关信息、送货地址相关信息里都有邮编、城市和省份。下面说明如何把客户维度里的两个邮编相关信息合并到一个新的维度中。

一、修改数据仓库表结构

        为了合并维度,需要改变数据仓库表结构。图1显示了修改后的结构。新增了一个zip_code_dim邮编信息维度表,sales_order_fact事实表的结构也做了相应的修改。

图1

        zip_code_dim维度表与销售订单事实表相关联。这个关系替换了事实表与客户维度的关系。sales_order_fact表需要两个关系,一个关联到客户地址邮编,另一个关联到送货地址邮编,相应的增加了两个外键字段。假设邮编相关信息不会修改,因此zip_code_dim表中没有是否删除、版本号、生效日期等SCD属性。
        下面的脚本用于修改数据仓库模式,所做的修改如下。
  • 创建邮编维度表zip_code_dim。
  • 初始装载邮编相关数据。
  • 基于zip_code_dim表创建v_customer_zip_code_dim和v_shipping_zip_code_dim视图。
  • 在sales_order_fact表上增加customer_zip_code_sk和shipping_zip_code_sk列。
  • 基于已有的客户邮编和送货邮编初始装载两个邮编代理键。
  • 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
  • 在pa_customer_dim上删除客户的城市、州和邮编列。
set search_path=tds; -- 建立邮编维度表    create table zip_code_dim ( zip_code_sk serial,     zip_code int,     city varchar(30),     state varchar(2));  comment on table zip_code_dim is '邮编维度表';      comment on column zip_code_dim.zip_code_sk is '邮编维度代理键';      comment on column zip_code_dim.zip_code is '邮编';      comment on column zip_code_dim.city is '城市';   comment on column zip_code_dim.state is '省份';      -- 初始装载邮编相关数据insert into zip_code_dim (zip_code, city, state)select distinct *   from (select customer_zip_code, customer_city, customer_state          from customer_dim         where customer_zip_code is not null         union all         select shipping_zip_code, shipping_city, shipping_state           from customer_dim         where shipping_zip_code is not null) t1; -- 创建视图    create view v_customer_zip_code_dim (customer_zip_code_sk, customer_zip_code, customer_city, customer_state) as    select * from zip_code_dim;        create view v_shipping_zip_code_dim (shipping_zip_code_sk, shipping_zip_code, shipping_city, shipping_state) as    select * from zip_code_dim;-- 添加邮编代理键alter table sales_order_fact add column customer_zip_code_sk int default null;alter table sales_order_fact add column shipping_zip_code_sk int default null;comment on column sales_order_fact.customer_zip_code_sk is '客户邮编代理键';    comment on column sales_order_fact.shipping_zip_code_sk is '送货邮编代理键';        -- 初始装载两个邮编代理键 create table sales_order_fact_bak as select * from sales_order_fact;truncate table sales_order_fact;insert into sales_order_factselect t1.order_number,       t1.customer_sk,       t1.product_sk,       t1.order_date_sk,       t1.year_month,       t1.order_amount,       t1.order_quantity,       t1.request_delivery_date_sk,       t1.sales_order_attribute_sk,       t2.customer_zip_code_sk,       t3.shipping_zip_code_sk     from sales_order_fact_bak t1  left join   (select a.order_number order_number,c.customer_zip_code_sk customer_zip_code_sk      from sales_order_fact_bak a,           customer_dim b,           v_customer_zip_code_dim c     where a.customer_sk = b.customer_sk       and b.customer_zip_code = c.customer_zip_code) t2 on t1.order_number = t2.order_number  left join  (select a.order_number order_number,c.shipping_zip_code_sk shipping_zip_code_sk      from sales_order_fact_bak a,           customer_dim b,           v_shipping_zip_code_dim c     where a.customer_sk = b.customer_sk       and b.shipping_zip_code = c.shipping_zip_code) t3 on t1.order_number = t3.order_number;   drop table sales_order_fact_bak; -- 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。alter table customer_dim drop column customer_zip_code cascade;alter table customer_dim drop column customer_city;alter table customer_dim drop column customer_state;alter table customer_dim drop column shipping_zip_code;alter table customer_dim drop column shipping_city;alter table customer_dim drop column shipping_state; alter table pa_customer_dim drop column customer_zip_code;alter table pa_customer_dim drop column customer_city;alter table pa_customer_dim drop column customer_state;alter table pa_customer_dim drop column shipping_zip_code;alter table pa_customer_dim drop column shipping_city;alter table pa_customer_dim drop column shipping_state;  -- 重建相关视图create or replace view v_customer_dim_latest as   select customer_sk,         customer_number,          customer_name,         customer_street_address,         version,         effective_date,       shipping_address     from (select distinct on (customer_number) customer_number,                  customer_sk,                   customer_name,                 customer_street_address,                isdelete,                  version,                 effective_date,               shipping_address             from customer_dim           order by customer_number, customer_sk desc) as latest     where isdelete is false;  create or replace view v_customer_dim_his as   select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date     from customer_dim;  create or replace view v_pa_customer_dim_latest as   select customer_sk,         customer_number,          customer_name,         customer_street_address,         version,         effective_date,       shipping_address     from (select distinct on (customer_number) customer_number,                  customer_sk,                   customer_name,                 customer_street_address,                isdelete,                  version,                 effective_date,               shipping_address             from pa_customer_dim           order by customer_number, customer_sk desc) as latest     where isdelete is false;  create or replace view v_pa_customer_dim_his as   select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date     from pa_customer_dim;  
        说明:
  • 邮编维度的初始数据是从客户维度表中来,这只是为了演示数据装载的过程。客户的邮编信息很可能覆盖不到所有邮编,所以更好的方法是装载一个完整的邮编信息表。由于客户地址和送货地址可能存在交叉的情况,因此使用distinct去重。送货地址的三个字段是后加的,在此之前数据的送货地址为空,邮编维度表中不能含有NULL值,所以要加上where shipping_zip_code is not null过滤条件去除邮编信息为NULL的数据行。
  • 基于邮编维度表创建客户邮编和送货邮编视图,分别用作两个地理信息的角色扮演维度。
  • 把数据备份表sales_order_fact_bak中的数据装载回销售订单事实表,同时需要关联两个邮编角色维度视图,查询出两个代理键,装载到事实表中。注意老的事实表与新的邮编维度表是通过客户维度表关联起来的,所以在子查询中需要三表连接,然后用两个左外连接查询出所有原事实表数据,装载到新的增加了邮编维度代理键的事实表中。
  • 在customer_dim表上删除列时,需要使用cascade子句同时删除依赖它的视图,之后重建相关视图。

二、修改定期数据装载函数

        定期装载函数有三个地方的修改:
  •  删除客户维度装载里所有邮编信息相关的列,因为客户维度里不再有客户邮编和送货邮编相关信息。
  •  在事实表中引用客户邮编视图和送货邮编视图中的代理键。
  •  修改pa_customer_dim装载,因为需要从销售订单事实表的customer_zip_code_sk获取客户邮编。
        修改后的fn_regular_load函数如下。
create or replace function fn_regular_load ()              returns void as              $$              declare                  -- 设置scd的生效时间                v_cur_date date := current_date;                    v_pre_date date := current_date - 1;                v_last_load date;            begin                -- 分析外部表                analyze ext.customer;                analyze ext.product;                analyze ext.sales_order;                            -- 将外部表数据装载到原始数据表                truncate table rds.customer;                  truncate table rds.product;                             insert into rds.customer select * from ext.customer;                 insert into rds.product select * from ext.product;                insert into rds.sales_order           select order_number,                 customer_number,                 product_code,                 order_date,                 entry_date,                 order_amount,                 order_quantity,                 request_delivery_date,             verification_ind,             credit_check_flag,             new_customer_ind,             web_order_flag        from ext.sales_order;                                -- 分析rds模式的表                analyze rds.customer;                analyze rds.product;                analyze rds.sales_order;                            -- 设置cdc的上限时间                select last_load into v_last_load from rds.cdc_time;                truncate table rds.cdc_time;                insert into rds.cdc_time select v_last_load, v_cur_date;                            -- 装载客户维度                insert into tds.customer_dim                (customer_number,                 customer_name,                 customer_street_address,                 shipping_address,                isdelete,                 version,                 effective_date)                select case flag                             when 'D' then a_customer_number                            else b_customer_number                        end customer_number,                       case flag                             when 'D' then a_customer_name                            else b_customer_name                        end customer_name,                       case flag                             when 'D' then a_customer_street_address                            else b_customer_street_address                        end customer_street_address,                       case flag                             when 'D' then a_shipping_address                            else b_shipping_address                        end shipping_address,                     case flag                             when 'D' then true                            else false                        end isdelete,                       case flag                             when 'D' then a_version                            when 'I' then 1                            else a_version + 1                        end v,                       v_pre_date                  from (select a.customer_number a_customer_number,                               a.customer_name a_customer_name,                               a.customer_street_address a_customer_street_address,                               a.shipping_address a_shipping_address,                               a.version a_version,                               b.customer_number b_customer_number,                               b.customer_name b_customer_name,                               b.customer_street_address b_customer_street_address,                               b.shipping_address b_shipping_address,                               case when a.customer_number is null then 'I'                                    when b.customer_number is null then 'D'                                    else 'U'                                 end flag                          from v_customer_dim_latest a                           full join rds.customer b on a.customer_number = b.customer_number                          where a.customer_number is null -- 新增                            or b.customer_number is null -- 删除                            or (a.customer_number = b.customer_number                                 and not                                        (coalesce(a.customer_name,'') = coalesce(b.customer_name,'')                                     and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'')                                     and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'')                                     ))) t                         order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999;                         -- 装载产品维度                insert into tds.product_dim                (product_code,                 product_name,                 product_category,                      isdelete,                 version,                 effective_date)                select case flag                             when 'D' then a_product_code                            else b_product_code                        end product_code,                       case flag                             when 'D' then a_product_name                            else b_product_name                        end product_name,                       case flag                             when 'D' then a_product_category                            else b_product_category                        end product_category,                       case flag                             when 'D' then true                            else false                        end isdelete,                       case flag                             when 'D' then a_version                            when 'I' then 1                            else a_version + 1                        end v,                       v_pre_date                  from (select a.product_code a_product_code,                               a.product_name a_product_name,                               a.product_category a_product_category,                               a.version a_version,                               b.product_code b_product_code,                               b.product_name b_product_name,                               b.product_category b_product_category,                                              case when a.product_code is null then 'I'                                    when b.product_code is null then 'D'                                    else 'U'                                 end flag                          from v_product_dim_latest a                           full join rds.product b on a.product_code = b.product_code                          where a.product_code is null -- 新增                            or b.product_code is null -- 删除                            or (a.product_code = b.product_code                                 and not                                        (a.product_name = b.product_name                                     and a.product_category = b.product_category))) t                         order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999;                    -- 装载销售订单事实表                  insert into sales_order_fact                  select a.order_number,                         customer_sk,                         product_sk,              e.date_sk,                       e.year * 100 + e.month,                            order_amount,                     order_quantity,                 f.date_sk,             g.sales_order_attribute_sk,           h.customer_zip_code_sk,               i.shipping_zip_code_sk      from rds.sales_order a,                        v_customer_dim_his c,                         v_product_dim_his d,                         date_dim e,                  date_dim f,               sales_order_attribute_dim g,            v_customer_zip_code_dim h,               v_shipping_zip_code_dim i,               rds.customer j,           rds.cdc_time k       where a.customer_number = c.customer_number                     and a.order_date >= c.effective_date                   and a.order_date < c.expiry_date                      and a.product_code = d.product_code                     and a.order_date >= d.effective_date                   and a.order_date < d.expiry_date                      and date(a.order_date) = e.date               and date(a.request_delivery_date) = f.date         and a.verification_ind = g.verification_ind             and a.credit_check_flag = g.credit_check_flag             and a.new_customer_ind = g.new_customer_ind             and a.web_order_flag = g.web_order_flag        and a.customer_number = j.customer_number           and j.customer_zip_code = h.customer_zip_code       and j.shipping_zip_code = i.shipping_zip_code        and a.entry_date >= k.last_load and a.entry_date < k.current_load;                                  -- 重载PA客户维度              truncate table pa_customer_dim;                insert into pa_customer_dim                select distinct a.*                    from customer_dim a,             sales_order_fact b,             v_customer_zip_code_dim c          where c.customer_state = 'pa'          and b.customer_zip_code_sk = c.customer_zip_code_sk         and a.customer_sk = b.customer_sk;       -- 分析tds模式的表                analyze customer_dim;                analyze product_dim;                analyze sales_order_fact;     analyze pa_customer_dim;                -- 更新时间戳表的last_load字段                  truncate table rds.cdc_time;                insert into rds.cdc_time select v_cur_date, v_cur_date;                        end;              $$              language plpgsql;
        上面的函数需要注意两个地方。装载事实表数据时,除了关联两个邮编维度视图外,还要关联过渡区的rds.customer表。这是因为要取得邮编维度代理键,必须连接邮编代码字段,而邮编代码已经从客户维度表中删除,只有在源数据的客户表中保留。第二个改变是PA子维度的装载。州代码已经从客户维度表删除,被放到了新的邮编维度表中,而客户维度和邮编维度并没有直接关系,它们是通过事实表的客户代理键和邮编代理键产生联系,因此必须关联事实表、客户维度表、邮编维度表三个表才能取出PA子维度数据。这也就是把PA子维度的装载放到了事实表装载之后的原因。

三、测试

        按照以下步骤测试修改后的定期装载脚本。
  1. 对源数据的客户邮编相关信息做一些修改。
  2. 装载新的客户数据前,查询最后的客户和送货邮编,后面可以用改变后的信息和此查询的输出作对比。
  3. 新增销售订单源数据。
  4. 执行定期装载。
  5. 查询客户维度表、售订单事实表和PA子维度表,确认数据已经正确装载。
        执行下面的语句,对源数据的客户信息做以下两处修改:客户编号4的客户和送货邮编信息;新增一个编号15的客户。
update source.customer      set customer_street_address = '9999 louise dr.',         customer_zip_code = 17055,          customer_city = 'pittsburgh',         shipping_address = '9999 louise dr.',         shipping_zip_code = 17055,         shipping_city = 'pittsburgh'   where customer_number = 4;    insert into source.customer   values(15, 'super stores', '1000 woodland st.', 17055, 'pittsburgh', 'pa', '1000 woodland st.', 17055, 'pittsburgh', 'pa');    commit;
        现在在装载新的客户数据前查询最后的客户和送货邮编。后面可以用改变后的信息和此查询的输出作对比。查询语句如下。
select order_date_sk odsk,         customer_number cn,         customer_zip_code czc,         shipping_zip_code szc    from v_customer_zip_code_dim a,         v_shipping_zip_code_dim b,         sales_order_fact c,         customer_dim d   where a.customer_zip_code_sk = c.customer_zip_code_sk     and b.shipping_zip_code_sk = c.shipping_zip_code_sk     and d.customer_sk = c.customer_sk;  order by odsk;
        然后使用下面的语句新增两条销售订单。
set @order_date := from_unixtime(unix_timestamp('2017-05-30 00:00:01') + rand() * (unix_timestamp('2017-05-30 12:00:00') - unix_timestamp('2017-05-30 00:00:01')));  set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400);       set @amount := floor(1000 + rand() * 9000);       set @quantity := floor(10 + rand() * 90);    insert into source.sales_order values    (null, 4, 3, 'y', 'y', 'y', 'n',  @order_date, @request_delivery_date,          @order_date, @amount, @quantity);            set @order_date := from_unixtime(unix_timestamp('2017-05-30 12:00:00') + rand() * (unix_timestamp('2017-05-31 00:00:00') - unix_timestamp('2017-05-30 12:00:00')));        set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400);set @amount := floor(1000 + rand() * 9000);       set @quantity := floor(10 + rand() * 90);      insert into source.sales_order values    (null, 15, 4, 'y', 'n', 'y', 'n', @order_date, @request_delivery_date,         @order_date, @amount, @quantity);  commit;
        执行下面的命令定期装载。
~/regular_etl.sh
        查询customer_dim表,确认两个改变的客户,即编号4和15的客户,已经正确装载。
select customer_sk csk,         customer_number cnum,         customer_name cnam,         customer_street_address csd,         shipping_address sd,         version,         effective_date,         expiry_date    from v_customer_dim_his    where customer_number in (4, 15);
        查询结果如图2所示。
图2

        查询sales_order_fact表里的两条新销售订单,确认邮编已经正确装载。
select a.order_number onum,         e.customer_number cnum,         b.customer_zip_code czc,         c.shipping_zip_code szc,         f.product_code pc,         d.order_date od,         a.order_amount,         a.order_quantity    from sales_order_fact a,         v_customer_zip_code_dim b,         v_shipping_zip_code_dim c,         v_order_date_dim d,         customer_dim e,         product_dim f   where a.customer_sk = e.customer_sk     and a.product_sk = f.product_sk     and a.customer_zip_code_sk = b.customer_zip_code_sk     and a.shipping_zip_code_sk = c.shipping_zip_code_sk     and a.order_date_sk = d.order_date_sk   order by a.order_number desc  limit 2;  
        查询结果如图3所示。
图3

        查询v_pa_customer_dim_his视图,确认PA客户正确装载。
select customer_sk csk,         customer_number cnum,         customer_name cnam,         customer_street_address csa,         shipping_address sad,         version,         effective_date,         expiry_date    from v_pa_customer_dim_his order by customer_sk;
        查询结果如图4所示。
图4
阅读全文
0 0
原创粉丝点击