hive之异常值处理

来源:互联网 发布:ubuntu net snmp 编辑:程序博客网 时间:2024/06/06 22:42

NULL值类型

count(col_name) 如果col_name的值是NULL,那么COUNT是不会把它算进去的,所以想统计所有日志数要使用COUNT(1)

而想对非空列进行相关操作,需要使用col_name IS NOT NULL. 而不是LENGTH(col_name>1), 因为LENGTH(NULL)是没有结果的

---------------------------------------------------------20170608更新---------------------------------------------------------

除了上述问题,null值在逻辑统计方面也带来一些麻烦。楼主在使用时,还遇到了如下问题。

问题描述:有条件A、B、C,想统计全部满足任意不满足其中一种情况的下的数据量。

原始代码如下:

SELECT   COUNT(1) AS all_user,  SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user,  SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user,  SUM(CASE WHEN !A THEN 1 ELSE 0 END) AS not_A,  SUM(CASE WHEN !B THEN 1 ELSE 0 END) AS not_B,  SUM(CASE WHEN !C THEN 1 ELSE 0 END) AS not_CFROM tb
按理说应该是not_A+not_B+not_C的值不小于case_user才对,但是楼主得到的数量是小于。一顿困惑后经大神点播发现了原因,还是NULL值作祟。
修改后的代码如下:
SELECT   COUNT(1) AS all_user,  SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user,  SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user,  SUM(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A,  SUM(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B,  SUM(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_CFROM tb

上述查询只能是一个全局统计,并不能了解每一个user的情况。也就是说,如果用户存在多条记录,有的记录是满足A AND B AND C的,有的是不满足的,那怎么统计每种条件下每个用户的记录满足情况呢?~~~~~~~使用MIN和MAX~~~~~~~

SELECT  COUNT(1) AS all_user, SUM(ok) AS all_ok, SUM(not_A) AS all_not_A, SUM(not_B) AS all_not_B, SUM(not_C) AS all_not_CFROM(    SELECT      userid    , MIN(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok    , MAX(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A    , MAX(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B    , MAX(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_C    FROM tb    GROUP BY userid) a

NaN值类型

利用sql进行一些数据处理操作时,有时会得到异常结果。比如当分母为0的时候,sql不会报错但是结果会是NaN。
可利用如下代码过滤这些异常值
select col_1, col_2, col_with_nanfrom my_tablewhere some_conditions  and cast(col_with_nan as String) <> 'NaN';


 
原创粉丝点击