hive 底层模块实现-distinct
来源:互联网 发布:运用python进行炒股 编辑:程序博客网 时间:2024/05/16 12:37
准备数据
语句
SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;hive> SELECT * FROM logs;OKa 苹果 3a 橙子 3a 烧鸡 1b 烧鸡 3hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
根据count分组,计算独立用户数。
计算过程
默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.
Operator
Explain
hive> explain SELECT uid, sum(count) FROM logs group by uid;OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 扫描表 alias: logs Select Operator //选择字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据 aggregations: expr: sum(count) //聚集函数 bucketGroup: false keys: //键 expr: uid type: string mode: hash //hash方式,processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //输出key,value给reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0)//聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial //合并值 outputColumnNames: _col0, _col1 Select Operator //选择字段 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //输出到文件 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
转载:http://ju.outofmemory.cn/entry/784
0 0
- hive 底层模块实现-distinct
- hive 底层模块实现-join
- hive 底层模块实现-group by
- Hive – Distinct 的实现
- Hive – Distinct 的实现
- Hive – Distinct 的实现
- hive语句优化-通过groupby实现distinct
- hive语句优化-通过groupby实现distinct
- hive语句优化-通过groupby实现distinct
- hive distinct groupby等实现原理
- hadoop-hive-DISTINCT
- hive count distinct
- hive select查询语句底层实现的某些细微差别
- #hive#hive中的Distinct,group by
- 底层:http模块express
- hive语句优化-通过groupby实现distinct(数据量特别大的时候,使用distinct去重容易导致数据倾斜)
- Hive SQL优化之 Count Distinct
- Hive SQL优化之 Count Distinct
- Trie SOPJ KAOS
- H3 BPM MVC表单SheetAttachment控件使用
- 完美解决Authentication denied: Boot identity not valid
- Android 组件化案例
- 笔记本电脑的硬件介绍和一些常见问题的解决
- hive 底层模块实现-distinct
- SpringBoot redis Session 域配置
- Python用WMI模块获取Windows系统的硬件信息
- Spring Boot 4--连接oracle数据库案例
- SpringMvc中@RequestMapping详解
- Eclipse无法启动报An internal error occurred during: "reload maven project". java.lang.NullPointerExceptio
- github
- Eclipse优化设置技巧
- AJDK 8.0.0 Release Notes