Hive SQL操作与函数自定义(一)

来源:互联网 发布:windowsphone同步软件 编辑:程序博客网 时间:2024/05/16 11:33

http://blog.csdn.net/u013980127/article/details/52604882


与传统SQL一致,并且概念简单的,本文没有详细说明或完全没有说明。另外,本文也没有按照官文一字不漏地翻译,只是提炼本人认为需要的。

1 DDL

1.1 Create/Drop/Alter/Use Database

1.1.1 Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name  [COMMENT database_comment]  [LOCATION hdfs_path]  [WITH DBPROPERTIES (property_name=property_value, ...)];


注:DATABASE与SCHEMA用途相同

1.1.2 Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];


注:默认RESTRICT,使用CASCADE可删除含表的数据库。

1.1.3 Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

1.1.4 Use Database

USE database_name;USE DEFAULT;-- 获取当前数据库SELECT current_database()

1.2 Create/Drop/Truncate Table

1.2.1 Create Table

1.2.1.1 语法

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)  [(col_name data_type [COMMENT col_comment], ...)]  [COMMENT table_comment]  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)     [STORED AS DIRECTORIES]  [   [ROW FORMAT row_format]    [STORED AS file_format]     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)  ]  [LOCATION hdfs_path]  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name  LIKE existing_table_or_view_name  [LOCATION hdfs_path];data_type  : primitive_type  | array_type  | map_type  | struct_type  | union_type  -- (Note: Available in Hive 0.7.0 and later)primitive_type  : TINYINT  | SMALLINT  | INT  | BIGINT  | BOOLEAN  | FLOAT  | DOUBLE  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)  | STRING  | BINARY      -- (Note: Available in Hive 0.8.0 and later)  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)  | DATE        -- (Note: Available in Hive 0.12.0 and later)  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)  | CHAR        -- (Note: Available in Hive 0.13.0 and later)array_type  : ARRAY map_type  : MAP 

1.2.1.2 Row Format, Storage Format, and SerDe

可以使用自定义的SerDe或者Hive自带的SerDe。如果没有指定ROW FORMAT或指定了ROW FORMAT DELIMITED时,会使用自带的SerDe。

‘hive.default.fileformat’设置默认存储格式。

参考
[Hive中的InputFormat、OutputFormat与SerDe]
(https://www.coder4.com/archives/4031)

1.2.1.3 Partitioned Table

通过PARTITIONED BY创建。可以指定一个或多个分区列,用CLUSTERED BY columns分桶,通过SORT BY排序桶中的数据。

-- 示例id     int,date   date,name   varcharcreate table table_name (  id                int,  dtDontQuery       string,  name              string)partitioned by (date string)

1.2.1.4 External Tables

LOCATION指定数据所在位置。删除外部表时,表中数据不会被删除。

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,     page_url STRING, referrer_url STRING,     ip STRING COMMENT 'IP Address of the User',     country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>';

1.2.1.5 Create Table As Select (CTAS)

限制:
1. 目标表不能是分区表;
2. 目标表不能是外部表;
3. 目标表不能是list bucketing表。

示例

CREATE TABLE new_key_value_store   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"   STORED AS RCFile   ASSELECT (key % 1024) new_key, concat(key, value) key_value_pairFROM key_value_storeSORT BY new_key, key_value_pair;

1.2.1.6 Create Table Like

复制存在的表的表定义,创建新表(不含数据)。

CREATE TABLE empty_key_value_storeLIKE key_value_store;

1.2.1.7 Bucketed Sorted Tables

对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。使用CLUSTERED BY 子句来指定划分桶所用的列和要划分的桶的个数。

CREATE TABLE page_view(viewTime INT, userid BIGINT,     page_url STRING, referrer_url STRING,     ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED   FIELDS TERMINATED BY '\001'   COLLECTION ITEMS TERMINATED BY '\002'   MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE;

参考:
[Hive 基础之:分区、桶、Sort Merge Bucket Join]
(http://blog.csdn.net/wisgood/article/details/17186107)

1.2.1.8 Skewed Tables

该特性为了优化表中一列或几列有数据倾斜的值。

CREATE TABLE list_bucket_single (key STRING, value STRING)  SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)  SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];

参考:
[HIVE 数据倾斜调优总结]
(http://www.cnblogs.com/end/archive/2012/06/19/2554582.html)

1.2.1.9 Temporary Tables


1. 不支持分区;
2. 不支持索引。

示例

CREATE TEMPORARY TABLE temp_table (key STRING, value STRING);

1.2.2 Drop Table

DROP TABLE [IF EXISTS] table_name [PURGE];


注:指定PURGE后,数据不会放到回收箱,会直接删除。

1.2.3 Truncate Table

TRUNCATE TABLE table_name [PARTITION partition_spec];partition_spec:  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)


注:指定PARTITION时,只删除PARTITION的数据,否则删除所有数据。

1.3 Alter Table/Partition/Column

1.3.1 Alter Table

-- 重命名表名ALTER TABLE table_name RENAME TO new_table_name;-- 修改表属性ALTER TABLE table_name SET TBLPROPERTIES table_properties;table_properties:  : (property_name = property_value, property_name = property_value, ... )-- 修改表注释ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);-- 增加SerDe属性ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;serde_properties:  : (property_name = property_value, property_name = property_value, ... )示例ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');-- 修改存储属性ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]  INTO num_buckets BUCKETS;-- Alter Table SkewedALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)  ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]  [STORED AS DIRECTORIES];-- Alter Table Not SkewedALTER TABLE table_name NOT SKEWED;-- Alter Table Not Stored as DirectoriesALTER TABLE table_name NOT STORED AS DIRECTORIES;-- Alter Table Set Skewed LocationALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );

1.3.2 Alter Partition

增加分区

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec   [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;partition_spec:  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)示例ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'                          PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';

动态分区

DynamicPartitions

重命名分区

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

交换分区

分区可以在表之间交换

ALTER TABLE table_name_1 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_2;-- multiple partitionsALTER TABLE table_name_1 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_2;

恢复分区 (MSCK REPAIR TABLE)

通过HDFS命令增加的分区,不会在Hive的metastore中有记录,可以通过下面命令,自动修复。

MSCK REPAIR TABLE table_name;

删除分区

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]  [IGNORE PROTECTION] [PURGE];            -- (Note: PURGE available in Hive 1.2.0 and later, IGNORE PROTECTION not available 2.0.0 and later)

(Un)Archive Partition

ALTER TABLE table_name ARCHIVE PARTITION partition_spec;ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

1.3.3 Alter Either Table or Partition

修改表或分区的文件格式

ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;

修改表或分区的存储位置

ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";

Alter Table/Partition Touch

ALTER TABLE table_name TOUCH [PARTITION partition_spec];

Alter Table/Partition Protections

ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;

Alter Table/Partition Compact

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])]  COMPACT 'compaction_type'  [WITH OVERWRITE TBLPROPERTIES ("property"="value" [, ...])];

Alter Table/Partition Concatenate
如果表或分区包含很多小的RCF文件或ORC文件,下面命令会合并小文件成大文件。

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

1.3.4 Alter Column

修改列名、类型、位置与注释

ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type  [COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];-- 默认RESTRICT,CASCADE除了修改表的metadata,也会修改所有分区的metadata。

增加或替换

ALTER TABLE table_name   [PARTITION partition_spec]                 -- (Note: Hive 0.14.0 and later)  ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)  [CASCADE|RESTRICT]                         -- (Note: Hive 0.15.0 and later)-- REPLACE会移除所有存在的列,增加新的列。

Partial Partition Specification

-- 以下操作SET hive.exec.dynamic.partition = true;ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);-- 等价于ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);...

支持的操作:Change column、Add column、Replace column、File Format、Serde Properties

1.4 Create/Drop/Alter View

1.4.1 Create View

语法:

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]  [COMMENT view_comment]  [TBLPROPERTIES (property_name = property_value, ...)]  AS SELECT ...;

示例:

CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')  COMMENT 'Referrers to The Onion website'  AS  SELECT DISTINCT referrer_url  FROM page_view  WHERE page_url='http://www.theonion.com';

1.4.2 Drop View

DROP VIEW [IF EXISTS] [db_name.]view_name;

1.4.3 Alter View Properties

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;table_properties:  : (property_name = property_value, property_name = property_value, ...)

1.4.4 Alter View As Select

ALTER VIEW [db_name.]view_name AS select_statement;

1.5 Create/Drop/Alter Index

1.5.1 Create Index

CREATE INDEX index_name  ON TABLE base_table_name (col_name, ...)  AS index_type  [WITH DEFERRED REBUILD]  [IDXPROPERTIES (property_name=property_value, ...)]  [IN TABLE index_table_name]  [     [ ROW FORMAT ...] STORED AS ...     | STORED BY ...  ]  [LOCATION hdfs_path]  [TBLPROPERTIES (...)]  [COMMENT "index comment"];

参考:
CREATE INDEX

1.5.2 Drop Index

DROP INDEX [IF EXISTS] index_name ON table_name;

1.5.3 Alter Index

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

1.6 Create/Drop Macro

1.6.1 Create Temporary Macro

只在当前session有效。

CREATE TEMPORARY MACRO macro_name([col_name col_type, ...]) expression;-- 示例CREATE TEMPORARY MACRO fixed_number() 42;CREATE TEMPORARY MACRO string_len_plus_two(x string) length(x) + 2;CREATE TEMPORARY MACRO simple_add (x int, y int) x + y;

1.6.2 Drop Temporary Macro

DROP TEMPORARY MACRO [IF EXISTS] macro_name;

1.7 Create/Drop/Reload Function

1.7.1 临时函数

-- 创建CREATE TEMPORARY FUNCTION function_name AS class_name;-- 删除DROP TEMPORARY FUNCTION [IF EXISTS] function_name;

1.7.2 永久函数

-- 创建CREATE FUNCTION [db_name.]function_name AS class_name  [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];-- 删除DROP FUNCTION [IF EXISTS] function_name;-- 使其他session创建的函数有效RELOAD FUNCTION;

1.8 Create/Drop/Grant/Revoke Roles and Privileges

参考:
Create/Drop/Grant/Revoke Roles and Privileges

1.9 Show

1.9.1 Show Databases

SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];


注:LIKE子句可以使用规则过滤,’*’代表任意个字符,’|’表示或,例如’emp*|*ees’

1.9.2 Show Tables/Partitions/Indexes

SHOW TABLES [IN database_name] ['identifier_with_wildcards'];SHOW PARTITIONS table_name;SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)];SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];-- 展示表属性SHOW TBLPROPERTIES table_name;SHOW TBLPROPERTIES table_name("foo");-- 显示建表文SHOW CREATE TABLE ([db_name.]table_name|view_name);-- 显示索引SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name];

1.9.3 Show Columns

SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name];

1.9.4 Show Functions

-- 显示以'a'开头的函数SHOW FUNCTIONS "a.*";-- 显示所有函数SHOW FUNCTIONS ".*";

1.9.5 Show Granted Roles and Privileges

参考:
Show Granted Roles and Privileges

1.9.6 Show Locks

SHOW LOCKS <table_name>;SHOW LOCKS <table_name> EXTENDED;SHOW LOCKS <table_name> PARTITION (<partition_spec>);SHOW LOCKS <table_name> PARTITION (<partition_spec>) EXTENDED;SHOW LOCKS (DATABASE|SCHEMA) database_name;     -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

参考:[HIVE-6460]
(https://issues.apache.org/jira/browse/HIVE-6460)

1.9.7 Show Conf

SHOW CONF <configuration_name>;-- 返回default valuerequired valuedescription

参考:[Hive 属性]
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)

1.9.8 Show Transactions

SHOW TRANSACTIONS;-- 返回transaction IDtransaction stateuser who started the transactionmachine where the transaction was started

1.9.9 Show Compactions

SHOW COMPACTIONS;

1.10 Describe

1.10.1 Describe Database

DESCRIBE DATABASE [EXTENDED] db_name;DESCRIBE SCHEMA [EXTENDED] db_name;     -- (Note: Hive 0.15.0 and later)例如hive> desc database w1;OKw1      hdfs://hsm01:9000/hive/warehouse/w1.db  zkpk    USER    Time taken: 0.032 seconds, Fetched: 1 row(s)


注:EXTENDED会展示数据库的属性

1.10.2 Describe Table/View/Column

-- Hive 1.x.x and 0.x.x only.DESCRIBE [EXTENDED|FORMATTED]   table_name[.col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];DESCRIBE [EXTENDED|FORMATTED]   [db_name.]table_name[ col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];-- 示例hive> desc province.province;OKprovince                string                  from deserializer hive> desc w1.province province;OKprovince                string                  from deserializer 

1.10.3 显示Column统计

-- (Note: Hive 0.14.0 and later)DESCRIBE FORMATTED [db_name.]table_name column_name;                              -- (Note: Hive 0.14.0 to 1.x.x)DESCRIBE FORMATTED [db_name.]table_name column_name PARTITION (partition_spec);

1.10.4 Describe Partition

-- (Hive 1.x.x and 0.x.x only)DESCRIBE [EXTENDED|FORMATTED] table_name[.column_name] PARTITION partition_spec;DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name [column_name] PARTITION partition_spec;

1.10.5 Hive 2.0+

DESCRIBE [EXTENDED | FORMATTED]    [db_name.]table_name [PARTITION partition_spec] [col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];

1.11 Abort

移除指定的事物。

ABORT TRANSACTIONS transactionID [, ...];

2 Statistics

参考:
StatsDev

分析Hive表和分区的统计信息

ANALYZE

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)  COMPUTE STATISTICS   [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)  [CACHE METADATA]       -- (Note: Hive 2.1.0 and later.)  [NOSCAN];-- 示例ANALYZE TABLE Table1 COMPUTE STATISTICS;ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS NOSCAN;-- 查看统计DESCRIBE EXTENDED TABLE1;DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);

3 Indexes

参考:
Indexing Resources

示例:

Create/build, show, and drop index:CREATE INDEX table01_index ON TABLE table01 (column2) AS 'COMPACT';SHOW INDEX ON table01;DROP INDEX table01_index ON table01;Create then build, show formatted (with column names), and drop index:CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;ALTER INDEX table02_index ON table2 REBUILD;SHOW FORMATTED INDEX ON table02;DROP INDEX table02_index ON table02;Create bitmap index, build, show, and drop:CREATE INDEX table03_index ON TABLE table03 (column4) AS 'BITMAP' WITH DEFERRED REBUILD;ALTER INDEX table03_index ON table03 REBUILD;SHOW FORMATTED INDEX ON table03;DROP INDEX table03_index ON table03;Create index in a new table:CREATE INDEX table04_index ON TABLE table04 (column5) AS 'COMPACT' WITH DEFERRED REBUILD IN TABLE table04_index_table;Create index stored as RCFile:CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT' STORED AS RCFILE;Create index stored as text file:CREATE INDEX table06_index ON TABLE table06 (column7) AS 'COMPACT' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;Create index with index properties:CREATE INDEX table07_index ON TABLE table07 (column8) AS 'COMPACT' IDXPROPERTIES ("prop1"="value1", "prop2"="value2");Create index with table properties:CREATE INDEX table08_index ON TABLE table08 (column9) AS 'COMPACT' TBLPROPERTIES ("prop3"="value3", "prop4"="value4");Drop index if exists:DROP INDEX IF EXISTS table09_index ON table09;Rebuild index on a partition:ALTER INDEX table10_index ON table10 PARTITION (columnX='valueQ', columnY='valueR') REBUILD;

4 Archiving

利用hadoop Archives可以把多个文件归档成为一个文件,归档成一个文件后还可以透明的访问每一个文件。这样可以减少分区中文件的数量。

启动archiving,需要设置3个配置:

hive> set hive.archive.enabled=true;hive> set hive.archive.har.parentdir.settable=true;hive> set har.partfile.size=1099511627776;

语法:

ALTER TABLE table_name ARCHIVE|UNARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

示例:

-- 执行归档ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')-- 解除归档ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')

5 DML

5.1 LOAD文件到表

5.1.1 语法

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

5.1.2 示例

-- 从本地导入数据到表格并追加原表LOAD DATA LOCAL INPATH `/tmp/pv_2013-06-08_us.txt` INTO TABLE c02 PARTITION(date='2013-06-08', country='US')-- 从hdfs导入数据到表格并覆盖原表:LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20131101/18/clickstat_gp_fatdt0/0' INTO table c02_clickstat_fatdt1 OVERWRITE PARTITION (dt='20131201');

5.2 从查询Insert到Hive表

5.2.1 标准Insert

语法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] (z,y) select_statement1 FROM from_statement;

示例:

from B insert table A select 1,‘abc’ limit 1

5.2.2 Hive扩展 (多inserts):

语法:

FROM from_statementINSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2][INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;FROM from_statementINSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2][INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

5.2.3 Hive扩展 (动态分区inserts)

5.2.3.1 语法

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

5.2.3.2 说明

默认disable。配置动态分区插入:

属性 默认值 说明 hive.exec.dynamic.partition false 是否启动动态分区插入 hive.exec.dynamic.partition.mode strict strict模式,在子句中必须至少指定一个静态分区 hive.exec.max.dynamic.partitions.pernode 100 每个mapper-reducer节点,可创建动态分区的最大数量 hive.exec.max.dynamic.partitions 1000 可创建动态分区总的最大数量 hive.exec.max.created.files 100000 MapReduce任务中,所有mapper-reducer可创建HDFS文件的最大数量 hive.error.on.empty.partition false 如果动态分区插入生成空结果集,是否抛出异常

5.2.3.3 示例

FROM page_view_stg pvsINSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

5.3 Write数据到文件系统

5.3.1 语法

-- 标准INSERT OVERWRITE [LOCAL] DIRECTORY directory1  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)  SELECT ... FROM ...-- Hive扩展 (multiple inserts):FROM from_statementINSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...row_format  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]        [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

5.3.2 示例

-- 导出文件到本地:INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;-- 导出文件到HDFS:INSERT OVERWRITE DIRECTORY '/user/admin/SqlldrDat/CnClickstat/20131101/19/clickstat_gp_fatdt0/0' SELECT a.* FROM c02_clickstat_fatdt1 a WHERE dt=’20131201’;-- 一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成:FROM src  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

5.4 Insert值到表

5.4.1 语法

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]Where values_row is:( value [, value ...] )where a value is either null or any valid SQL literal

5.4.2 示例

INSERT INTO TABLE students  VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

5.5 Update

语法:

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

5.6 Delete

DELETE FROM tablename [WHERE expression]

6 import/export

6.1 语法

-- EXPORTEXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]  TO 'export_target_path' [ FOR replication('eventid') ]-- IMPORTIMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]  FROM 'source_path'  [LOCATION 'import_target_path']

6.2 示例

简单导出导入示例:export table department to 'hdfs_exports_location/department';import from 'hdfs_exports_location/department';使用import重命名表:export table department to 'hdfs_exports_location/department';import table imported_dept from 'hdfs_exports_location/department';导出导入分区:export table employee partition (emp_country="in", emp_state="ka") to 'hdfs_exports_location/employee';import from 'hdfs_exports_location/employee';导出表,导入分区:export table employee to 'hdfs_exports_location/employee';import table employee partition (emp_country="us", emp_state="tn") from 'hdfs_exports_location/employee';指定导入的路径:export table department to 'hdfs_exports_location/department';import table department from 'hdfs_exports_location/department'        location 'import_target_location/department';导入到外部表:export table department to 'hdfs_exports_location/department';import external table department from 'hdfs_exports_location/department';

7 explain plan

7.1 语法

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

示例:

hive>     >     > explain    > select * from student;OKSTAGE DEPENDENCIES:  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        TableScan          alias: student          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE          Select Operator            expressions: classno (type: string), stuno (type: string), score (type: float)            outputColumnNames: _col0, _col1, _col2            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE            ListSinkTime taken: 0.106 seconds, Fetched: 17 row(s)

7.2 DEPENDENCY

更多的关于inputs的信息会被展示。

7.3 AUTHORIZATION

权限相关信息会被展示


hive.explain.user=true (default is false) 时,用户会看到更多的执行信息。

8 Select

8.1 语法

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)SELECT [ALL | DISTINCT] select_expr, select_expr, ...  FROM table_reference  [WHERE where_condition]  [GROUP BY col_list]  [ORDER BY col_list]  [CLUSTER BY col_list    | [DISTRIBUTE BY col_list] [SORT BY col_list]  ] [LIMIT number]

8.2 REGEX列

hive.support.quoted.identifiers配置项(默认column)设置为none时,可以使用正则。正则规则和Java相同。

-- 返回除ds和hr的所有列SELECT `(ds|hr)?+.+` FROM sales

8.3 Group by

8.3.1 语法

groupByClause: GROUP BY groupByExpression (, groupByExpression)*groupByExpression: expressiongroupByQuery: SELECT expression (, expression)* FROM src groupByClause?

示例(Multi-Group-By Inserts)

FROM pv_usersINSERT OVERWRITE TABLE pv_gender_sum  SELECT pv_users.gender, count(DISTINCT pv_users.userid)  GROUP BY pv_users.genderINSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'  SELECT pv_users.age, count(DISTINCT pv_users.userid)  GROUP BY pv_users.age;

8.3.2 增强聚集

8.3.2.1 GROUPING sets

GROUPING SET相当于多个GROUP BY查询语句,通过UNION连接。

示例

GROUPING SETS例 等价的GROUP BY例 SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ((a, b), a, b, ()) SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM(c) FROM tab1 GROUP BY a, null
UNION
SELECT null, b, SUM(c) FROM tab1 GROUP BY null, b
UNION
SELECT null, null, SUM(c) FROM tab1

8.3.2.2 GROUPING__ID函数

表示结果属于哪一个分组集合.

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,GROUPING__ID FROM lxw1234 GROUP BY month,day GROUPING SETS (month, day, (month, day)) ORDER BY GROUPING__ID;month         day             uv      GROUPING__ID------------------------------------------------2015-03       NULL            5       12015-04       NULL            6       1NULL          2015-03-10      4       2NULL          2015-03-12      1       2NULL          2015-04-12      2       2NULL          2015-04-13      3       2NULL          2015-04-15      2       2NULL          2015-04-16      2       22015-03       2015-03-10      4       32015-03       2015-03-12      1       32015-04       2015-04-12      2       32015-04       2015-04-13      3       32015-04       2015-04-15      2       32015-04       2015-04-16      2       3

8.3.2.3 CUBE

完成对字段列中的所有可能组合进行GROUP BY的功能。

GROUP BY a, b, c WITH CUBE 等同于  GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c),(a), (b), (c), ()) 

8.3.2.4 ROLLUP

GROUPING SETS的特例,用于计算从一个维度进行层级聚合的操作。

GROUP BY a, b, c, WITH ROLLUP 等同于GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ())

8.3.2.5 hive.new.job.grouping.set.cardinality

grouping set数据量大于这个值时,则增加额外的map-reduce任务以减少map侧的数据量。

8.4 Order, Sort, Cluster, Distribute By与Transform

9.4.1 Order By

语法:

colOrder: ( ASC | DESC )colNullOrder: (NULLS FIRST | NULLS LAST)           -- (Note: Available in Hive 2.1.0 and later)orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*query: SELECT expression (',' expression)* FROM src orderBy

注:
hive.mapred.mode=strict时,必须跟limit子句。原因是order子句最后只能用一个reduce排序并输出最后结果,如果数据量大的话,会花很长时间。

8.4.2 Sort By

语法:

colOrder: ( ASC | DESC )sortBy: SORT BY colName colOrder? (',' colName colOrder?)*query: SELECT expression (',' expression)* FROM src sortBy

说明:
输出到reducer前,根据列进行行排序,即多个reduce时,分别进行排序。

8.4.3 Distribute By

Hive在列上使用Distribute By,为了分散行到不同的reducer。拥有相同Distribute By列的所有行会进入相同的reducer。

8.4.4 Cluster By

Cluster By是Distribute By和Sort By的结合。但是排序只能是倒序排序,不能指定排序规则为asc 或者desc。

8.4.5 Transform

调用用户自定义的 Hive 使用的 Map/Reduce 脚本。

表student(classNo,stuNo,score)

#! /usr/bin/env pythonimport sysfor line in sys.stdin:  (classNo,stuNo,score) = line.strip().split('\t')      ifint(score) >= 60:      print"%s\t%s\t%s" %(classNo,stuNo,score)
add file /home/user/score_pass.py;
select   transform(classNo, stuNo, score)   using'score_pass.py'   as classNo, stuNo, score from student;

8.4.6 参考

  1. [Hive的Transform功能]
    (http://www.tuicool.com/articles/fuimmmQ)

  2. [Hive查询]
    (http://blog.csdn.net/zythy/article/details/18814781)

8.5 Join

8.5.1 语法

join_table:    table_reference JOIN table_factor [join_condition]  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition  | table_reference LEFT SEMI JOIN table_reference join_condition  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)table_reference:    table_factor  | join_tabletable_factor:    tbl_name [alias]  | table_subquery alias  | ( table_references )join_condition:    ON equality_expression ( AND equality_expression )*equality_expression:    expression = expression


注:Hive不支持所有非等值的连接

8.5.2 示例

如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务

SELECT a.val, b.val, c.val FROM a JOIN b    ON (a.key = b.key1) JOIN c    ON (c.key = b.key1)

这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)  JOIN c ON (c.key = b.key2)


注:reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后

可以通过hint改变序列化的表:

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

8.5.3 LEFT, RIGHT, and FULL OUTER

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

LEFT:将保留a的所有值;
RIGHT:将保留b的所有值;
FULL:将保留a、b的所有值。

8.5.4 LEFT SEMI JOIN

IN/EXISTS 子查询的一种更高效的实现。JOIN 子句中右边的表只能在 ON 子句中设置过滤条件。

SELECT a.key, a.valueFROM aWHERE a.key in (SELECT b.key  FROM B);

可以改写为

SELECT a.key, a.valFROM a LEFT SEMI JOIN b ON (a.key = b.key)

8.5.4 MapJoin

hive.auto.convert.join设置为true时,可能的话,执行时会自动转换为mapjoin。

Bucket mapjoin

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;set hive.optimize.bucketmapjoin = true;set hive.optimize.bucketmapjoin.sortedmerge = true;

8.5.5 优化

JoinOptimization

深入浅出数据仓库中SQL性能优化之Hive篇

8.6 UNION

8.6.1 语法

select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...


注:
1. 1.2.0以前,只支持UNION ALL;
2. UNION与UNION ALL混合使用时,UNION会覆盖左侧的UNION ALL;
3. Select的列别名必须一致(含类型)。

8.6.2 示例

From子句中使用UNION

SELECT u.id, actions.dateFROM (    SELECT av.uid AS uid    FROM action_video av    WHERE av.date = '2008-06-03'    UNION ALL    SELECT ac.uid AS uid    FROM action_comment ac    WHERE ac.date = '2008-06-03' ) actions JOIN users u ON (u.id = actions.uid)

可以在DDL与Insert语句中使用;

子句中使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY 或 LIMIT等需要括号括起来。

SELECT key FROM (SELECT key FROM src ORDER BY key LIMIT 10)subq1UNIONSELECT key FROM (SELECT key FROM src1 ORDER BY key LIMIT 10)subq2

UNION语句中使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY 或 LIMIT等,需要放到最后。

SELECT key FROM srcUNIONSELECT key FROM src1 ORDER BY key LIMIT 10

8.7 TABLESAMPLE

用来从Hive表中根据一定的规则进行数据取样。

8.7.1 分桶表取样

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])-- 用来从Hive表中根据一定的规则进行数据取样-- 该语句表示将表lxw1随机分成10个桶,抽样第一个桶的数据;SELECT COUNT(1) FROM lxw1 TABLESAMPLE (BUCKET 1 OUT OF 10 ON rand());

8.7.2 块取样

block_sample: TABLESAMPLE (n PERCENT)SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;block_sample: TABLESAMPLE (ByteLengthLiteral)ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')SELECT * FROM source TABLESAMPLE(100M) s;block_sample: TABLESAMPLE (n ROWS)SELECT * FROM source TABLESAMPLE(10 ROWS);

8.8 子查询

可以在from、where中使用子查询。

8.9 虚拟列

虚拟列 说明 INPUT__FILE__NAME mapper任务的输入文件名 BLOCK__OFFSET__INSIDE__FILE 当前全局文件位置

8.10 UDF

见第9章

8.11 Lateral View

8.11.1 语法

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*fromClause: FROM baseTable (lateralView)*

通过Lateral view可以方便的将UDTF得到的行转列的结果集合在一起提供服务。

8.11.2 示例

数据

Array col1 Array col2 [1, 2] [a”, “b”, “c”] [3, 4] [d”, “e”, “f”]
SELECT myCol1, col2 FROM baseTableLATERAL VIEW explode(col1) myTable1 AS myCol1;int mycol1 | Array<string> col2---------- | ------------------1          | [a", "b", "c"]2          | [a", "b", "c"]3          | [d", "e", "f"]4          | [d", "e", "f"]

8.11.3 多LATERAL VIEW

SELECT myCol1, myCol2 FROM baseTableLATERAL VIEW explode(col1) myTable1 AS myCol1LATERAL VIEW explode(col2) myTable2 AS myCol2;

8.11.4 Outer Lateral Views

与外链接一样。Lateral View是

SELEC * FROM src LATERAL VIEW explode(array()) C AS a limit 1;-- 无结果返回SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 1;-- 返回河北省 NULL

8.12 窗口分析

8.12.1 窗口函数

准备数据

CREATE EXTERNAL TABLE demo_win_fun (cookieid string,createtime string,  --页面访问时间url STRING       --被访问页面) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/hw/hive/win/';cookie1,2015-04-10 10:00:02,url2cookie1,2015-04-10 10:00:00,url1cookie1,2015-04-10 10:03:04,1url3cookie1,2015-04-10 10:50:05,url6cookie1,2015-04-10 11:00:00,url7cookie1,2015-04-10 10:10:00,url4cookie1,2015-04-10 10:50:01,url5cookie2,2015-04-10 10:00:02,url22cookie2,2015-04-10 10:00:00,url11cookie2,2015-04-10 10:03:04,1url33cookie2,2015-04-10 10:50:05,url66cookie2,2015-04-10 11:00:00,url77cookie2,2015-04-10 10:10:00,url44cookie2,2015-04-10 10:50:01,url55

8.12.1.1 LEAD

语法:

LEAD(col, n, DEFAULT) 用于统计窗口内往下第n行值。第一个参数为列名;第二个参数为往下第n行(可选,默认为1);第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL

示例

SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time FROM demo_win_fun;cookieid createtime             url    rn       next_1_time             next_2_time -------------------------------------------------------------------------------------------cookie1 2015-04-10 10:00:00     url1    1       2015-04-10 10:00:02     2015-04-10 10:03:04cookie1 2015-04-10 10:00:02     url2    2       2015-04-10 10:03:04     2015-04-10 10:10:00cookie1 2015-04-10 10:03:04     1url3   3       2015-04-10 10:10:00     2015-04-10 10:50:01cookie1 2015-04-10 10:10:00     url4    4       2015-04-10 10:50:01     2015-04-10 10:50:05cookie1 2015-04-10 10:50:01     url5    5       2015-04-10 10:50:05     2015-04-10 11:00:00cookie1 2015-04-10 10:50:05     url6    6       2015-04-10 11:00:00     NULLcookie1 2015-04-10 11:00:00     url7    7       1970-01-01 00:00:00     NULLcookie2 2015-04-10 10:00:00     url11   1       2015-04-10 10:00:02     2015-04-10 10:03:04cookie2 2015-04-10 10:00:02     url22   2       2015-04-10 10:03:04     2015-04-10 10:10:00cookie2 2015-04-10 10:03:04     1url33  3       2015-04-10 10:10:00     2015-04-10 10:50:01cookie2 2015-04-10 10:10:00     url44   4       2015-04-10 10:50:01     2015-04-10 10:50:05cookie2 2015-04-10 10:50:01     url55   5       2015-04-10 10:50:05     2015-04-10 11:00:00cookie2 2015-04-10 10:50:05     url66   6       2015-04-10 11:00:00     NULLcookie2 2015-04-10 11:00:00     url77   7       1970-01-01 00:00:00     NULL

8.12.1.2 LAG

语法

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL

示例

SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM demo_win_fun;cookieid createtime             url    rn       last_1_time             last_2_time-------------------------------------------------------------------------------------------cookie1 2015-04-10 10:00:00     url1    1       1970-01-01 00:00:00     NULLcookie1 2015-04-10 10:00:02     url2    2       2015-04-10 10:00:00     NULLcookie1 2015-04-10 10:03:04     1url3   3       2015-04-10 10:00:02     2015-04-10 10:00:00cookie1 2015-04-10 10:10:00     url4    4       2015-04-10 10:03:04     2015-04-10 10:00:02cookie1 2015-04-10 10:50:01     url5    5       2015-04-10 10:10:00     2015-04-10 10:03:04cookie1 2015-04-10 10:50:05     url6    6       2015-04-10 10:50:01     2015-04-10 10:10:00cookie1 2015-04-10 11:00:00     url7    7       2015-04-10 10:50:05     2015-04-10 10:50:01cookie2 2015-04-10 10:00:00     url11   1       1970-01-01 00:00:00     NULLcookie2 2015-04-10 10:00:02     url22   2       2015-04-10 10:00:00     NULLcookie2 2015-04-10 10:03:04     1url33  3       2015-04-10 10:00:02     2015-04-10 10:00:00cookie2 2015-04-10 10:10:00     url44   4       2015-04-10 10:03:04     2015-04-10 10:00:02cookie2 2015-04-10 10:50:01     url55   5       2015-04-10 10:10:00     2015-04-10 10:03:04cookie2 2015-04-10 10:50:05     url66   6       2015-04-10 10:50:01     2015-04-10 10:10:00cookie2 2015-04-10 11:00:00     url77   7       2015-04-10 10:50:05     2015-04-10 10:50:01

8.12.1.3 FIRST_VALUE

语法

FIRST_VALUE(col)取分组内排序后,截止到当前行,第一个值

示例

SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM demo_win_fun;cookieid  createtime            url     rn      first1---------------------------------------------------------cookie1 2015-04-10 10:00:00     url1    1       url1cookie1 2015-04-10 10:00:02     url2    2       url1cookie1 2015-04-10 10:03:04     1url3   3       url1cookie1 2015-04-10 10:10:00     url4    4       url1cookie1 2015-04-10 10:50:01     url5    5       url1cookie1 2015-04-10 10:50:05     url6    6       url1cookie1 2015-04-10 11:00:00     url7    7       url1cookie2 2015-04-10 10:00:00     url11   1       url11cookie2 2015-04-10 10:00:02     url22   2       url11cookie2 2015-04-10 10:03:04     1url33  3       url11cookie2 2015-04-10 10:10:00     url44   4       url11cookie2 2015-04-10 10:50:01     url55   5       url11cookie2 2015-04-10 10:50:05     url66   6       url11cookie2 2015-04-10 11:00:00     url77   7       url11

8.12.1.4 LAST_VALUE

语法

LAST_VALUE(col)取分组内排序后,截止到当前行,最后一个值

示例

SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 FROM demo_win_fun;cookieid  createtime            url    rn       last1  -----------------------------------------------------------------cookie1 2015-04-10 10:00:00     url1    1       url1cookie1 2015-04-10 10:00:02     url2    2       url2cookie1 2015-04-10 10:03:04     1url3   3       1url3cookie1 2015-04-10 10:10:00     url4    4       url4cookie1 2015-04-10 10:50:01     url5    5       url5cookie1 2015-04-10 10:50:05     url6    6       url6cookie1 2015-04-10 11:00:00     url7    7       url7cookie2 2015-04-10 10:00:00     url11   1       url11cookie2 2015-04-10 10:00:02     url22   2       url22cookie2 2015-04-10 10:03:04     1url33  3       1url33cookie2 2015-04-10 10:10:00     url44   4       url44cookie2 2015-04-10 10:50:01     url55   5       url55cookie2 2015-04-10 10:50:05     url66   6       url66cookie2 2015-04-10 11:00:00     url77   7       url77

8.12.2 OVER子句

数据准备

CREATE EXTERNAL TABLE demo_over (cookieid string,createtime string,   --day pv INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/hw/hive/over/';cookie1,2015-04-10,1cookie1,2015-04-11,5cookie1,2015-04-12,7cookie1,2015-04-13,3cookie1,2015-04-14,2cookie1,2015-04-15,4cookie1,2015-04-16,4

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点

8.12.2.1 SUM

语法

SUM(col) 结果和ORDER BY相关,默认为升序

示例

SELECT cookieid,createtime,pv,SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  FROM demo_overorder by createtime;cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 -----------------------------------------------------------------------------cookie1  2015-04-10      1       1       1       26      1       6       26cookie1  2015-04-11      5       6       6       26      6       13      25cookie1  2015-04-12      7       13      13      26      13      16      20cookie1  2015-04-13      3       16      16      26      16      18      13cookie1  2015-04-14      2       18      18      26      17      21      10cookie1  2015-04-15      4       22      22      26      16      20      8cookie1  2015-04-16      4       26      26      26      13      13      4

8.12.2.2 COUNT

示例

SELECT cookieid,createtime,pv,COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 COUNT(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  FROM demo_overorder by createtime;cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 -----------------------------------------------------------------------------cookie1 2015-04-10  1   1   1   7   1   2   7cookie1 2015-04-11  5   2   2   7   2   3   6cookie1 2015-04-12  7   3   3   7   3   4   5cookie1 2015-04-13  3   4   4   7   4   5   4cookie1 2015-04-14  2   5   5   7   4   5   3cookie1 2015-04-15  4   6   6   7   4   5   2cookie1 2015-04-16  4   7   7   7   4   4   1

8.12.2.3 AVG

示例

SELECT cookieid,createtime,pv,AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  FROM demo_overorder by createtime;cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 -----------------------------------------------------------------------------cookie1 2015-04-10      1       1.0     1.0     3.7142857142857144      1.0     3.0     3.7142857142857144cookie1 2015-04-11      5       3.0     3.0     3.7142857142857144      3.0     4.333333333333333       4.166666666666667cookie1 2015-04-12      7       4.333333333333333       4.333333333333333       3.7142857142857144      4.333333333333333       4.0     4.0cookie1 2015-04-13      3       4.0     4.0     3.7142857142857144      4.0     3.6     3.25cookie1 2015-04-14      2       3.6     3.6     3.7142857142857144      4.25    4.2     3.3333333333333335cookie1 2015-04-15      4       3.6666666666666665      3.6666666666666665      3.7142857142857144      4.0     4.0     4.0cookie1 2015-04-16      4       3.7142857142857144      3.7142857142857144      3.7142857142857144      3.25    3.25    4.0

8.12.2.4 MIN

示例

SELECT cookieid,createtime,pv,MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  FROM demo_overorder by createtime;cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 -----------------------------------------------------------------------------cookie1 2015-04-10      1       1       1       1       1       1       1cookie1 2015-04-11      5       1       1       1       1       1       2cookie1 2015-04-12      7       1       1       1       1       1       2cookie1 2015-04-13      3       1       1       1       1       1       2cookie1 2015-04-14      2       1       1       1       2       2       2cookie1 2015-04-15      4       1       1       1       2       2       4cookie1 2015-04-16      4       1       1       1       2       2       4

8.12.2.5 MAX

示例

SELECT cookieid,createtime,pv,MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  FROM demo_overorder by createtime;cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 -----------------------------------------------------------------------------cookie1 2015-04-10      1       1       1       7       1       5       7cookie1 2015-04-11      5       5       5       7       5       7       7cookie1 2015-04-12      7       7       7       7       7       7       7cookie1 2015-04-13      3       7       7       7       7       7       4cookie1 2015-04-14      2       7       7       7       7       7       4cookie1 2015-04-15      4       7       7       7       7       7       4cookie1 2015-04-16      4       7       7       7       4       4       4

8.12.3 分析函数

数据准备

d1,user1,1000d1,user2,2000d1,user3,3000d2,user4,4000d2,user5,5000CREATE EXTERNAL TABLE analytics_1 (dept STRING,userid string,sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/hw/hive/analytics/1';
cookie1,2015-04-10,1cookie1,2015-04-11,5cookie1,2015-04-12,7cookie1,2015-04-13,3cookie1,2015-04-14,2cookie1,2015-04-15,4cookie1,2015-04-16,4cookie2,2015-04-10,2cookie2,2015-04-11,3cookie2,2015-04-12,5cookie2,2015-04-13,6cookie2,2015-04-14,3cookie2,2015-04-15,9cookie2,2015-04-16,7CREATE EXTERNAL TABLE analytics_2 (cookieid string,createtime string,   --day pv INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/hw/hive/analytics/2';

8.12.3.1 RANK

语法

RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位

8.12.3.2 DENSE_RANK

语法

DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

示例

SELECT cookieid,createtime,pv,RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 FROM analytics_2 WHERE cookieid = 'cookie1';cookieid day           pv       rn1     rn2     rn3 -------------------------------------------------- cookie1 2015-04-12      7       1       1       1cookie1 2015-04-11      5       2       2       2cookie1 2015-04-15      4       3       3       3cookie1 2015-04-16      4       3       3       4cookie1 2015-04-13      3       5       4       5cookie1 2015-04-14      2       6       5       6cookie1 2015-04-10      1       7       6       7rn1: 15号和16号并列第3, 13号排第5rn2: 15号和16号并列第3, 13号排第4rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。

8.12.3.3 ROW_NUMBER

语法

ROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列比如,按照pv降序排列,生成分组内每天的pv名次ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

示例

SELECT cookieid,createtime,pv,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn FROM analytics_2;cookieid day           pv       rn------------------------------------------- cookie1 2015-04-12      7       1cookie1 2015-04-11      5       2cookie1 2015-04-15      4       3cookie1 2015-04-16      4       4cookie1 2015-04-13      3       5cookie1 2015-04-14      2       6cookie1 2015-04-10      1       7cookie2 2015-04-15      9       1cookie2 2015-04-16      7       2cookie2 2015-04-13      6       3cookie2 2015-04-12      5       4cookie2 2015-04-14      3       5cookie2 2015-04-11      3       6cookie2 2015-04-10      2       7

8.12.3.4 CUME_DIST

语法

CUME_DIST() 小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例

示例

SELECT dept,userid,sal,CUME_DIST() OVER(ORDER BY sal) AS rn1,CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM analytics_1;dept    userid   sal   rn1       rn2 -------------------------------------------d1      user1   1000    0.2     0.3333333333333333d1      user2   2000    0.4     0.6666666666666666d1      user3   3000    0.6     1.0d2      user4   4000    0.8     0.5d2      user5   5000    1.0     1.0rn1: 没有partition,所有数据均为1组,总行数为5,     第一行:小于等于1000的行数为1,因此,1/5=0.2     第三行:小于等于3000的行数为3,因此,3/5=0.6rn2: 按照部门分组,dpet=d1的行数为3,     第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666

8.12.3.5 PERCENT_RANK

语法

PERCENT_RANK() 分组内当前行的RANK值-1/分组内总行数-1

示例

SELECT dept,userid,sal,PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   --分组内RANK() OVER(ORDER BY sal) AS rn11,          --分组内RANK值SUM(1) OVER(PARTITION BY NULL) AS rn12,     --分组内总行数PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM analytics_1;dept    userid   sal    rn1    rn11     rn12    rn2---------------------------------------------------d1      user1   1000    0.0     1       5       0.0d1      user2   2000    0.25    2       5       0.5d1      user3   3000    0.5     3       5       1.0d2      user4   4000    0.75    4       5       0.0d2      user5   5000    1.0     5       5       1.0rn1: rn1 = (rn11-1) / (rn12-1)        第一行,(1-1)/(5-1)=0/4=0       第二行,(2-1)/(5-1)=1/4=0.25       第四行,(4-1)/(5-1)=3/4=0.75rn2: 按照dept分组,     dept=d1的总行数为3     第一行,(1-1)/(3-1)=0     第三行,(3-1)/(3-1)=1

8.12.3.6 NTILE

语法

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值。NTILE不支持ROWS BETWEEN如果切片不均匀,默认增加第一个切片的分布

示例

SELECT cookieid,createtime,pv,NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,    --分组内将数据分成2片NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,  --分组内将数据分成3片NTILE(4) OVER(ORDER BY createtime) AS rn3        --将所有数据分成4FROM analytics_2 ORDER BY cookieid,createtime;cookieid day           pv       rn1     rn2     rn3-------------------------------------------------cookie1 2015-04-10      1       1       1       1cookie1 2015-04-11      5       1       1       1cookie1 2015-04-12      7       1       1       2cookie1 2015-04-13      3       1       2       2cookie1 2015-04-14      2       2       2       3cookie1 2015-04-15      4       2       3       3cookie1 2015-04-16      4       2       3       4cookie2 2015-04-10      2       1       1       1cookie2 2015-04-11      3       1       1       1cookie2 2015-04-12      5       1       1       2cookie2 2015-04-13      6       1       2       2cookie2 2015-04-14      3       2       2       3cookie2 2015-04-15      9       2       3       4cookie2 2015-04-16      7       2       3       4

8.12.4 Distinct(2.1.0及以后)

COUNT(DISTINCT a) OVER (PARTITION BY c)

8.12.5 OVER子句中使用聚合函数(2.1.0及以后)

SELECT rank() OVER (ORDER BY sum(b))FROM TGROUP BY a;

8.12.6 参考

Hive分析窗口函数(一) SUM,AVG,MIN,MAX

Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK

Hive分析窗口函数(三) CUME_DIST,PERCENT_RANK

Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

8.13 With表达式

指定WITH语句,通过查询生成的临时的结果集。可以用于SELECT, INSERT, CREATE TABLE AS SELECT, 或者CREATE VIEW AS SELECT语句。


1. 子查询中不支持with语句;
2. 递归查询不被支持。

8.13.1 语法

withClause: cteClause (, cteClause)*cteClause: cte_name AS (select statment)

8.13.2 示例

with q1 as ( select key from src where key = '5')select *from q1;-- from stylewith q1 as (select * from src where key= '5')from q1select *;-- chaining CTEswith q1 as ( select key from q2 where key = '5'),q2 as ( select key from src where key = '5')select * from (select key from q1) a;-- union examplewith q1 as (select * from src where key= '5'),q2 as (select * from src s2 where key = '4')select * from q1 union all select * from q2;-- insert examplecreate table s1 like src;with q1 as ( select key, value from src where key = '5')from q1insert overwrite table s1select *;-- ctas examplecreate table s2 aswith q1 as ( select key from src where key = '4')select * from q1;-- view examplecreate view v1 aswith q1 as ( select key from src where key = '5')select * from q1;select * from v1;-- view example, name collisioncreate view v1 aswith q1 as ( select key from src where key = '5')select * from q1;with q1 as ( select key from src where key = '4')select * from v1;

8.14 参考

  1. [官方:LanguageManual Select]
    (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select)
0 0
原创粉丝点击