Hive学习笔记

来源:互联网 发布:js获取数组索引值 编辑:程序博客网 时间:2024/06/06 11:00

关于O'Reilly的《Programming Hive》的学习笔记

一、数据类型
1.基本类型

tinyint、smallint、int、bigint、boolean、float、double、string、timestamp、binary

和其他数据库一样,都是大小写不敏感的。而且,它们的实现都是基于java实现的。binary和其他数据库的varbinary类似,但是和blog不一样,binary存储的都是字节。

类型转换
hive会自动进行类型转换,但是都是往高位转换,例如tinyint-》smallint。如果需要显示转换,可以这样使用
select cast(s as int) from table_name;


2.集合类型
struct、map、array


3.文本文件编码
hive支持cvs格式
hive默认的记录和域分隔符

例子:

CREATE TABLE employees (name STRING,salary FLOAT,subordinates ARRAY<STRING>,deductions MAP<STRING, FLOAT>,address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\001'COLLECTION ITEMS TERMINATED BY '\002'MAP KEYS TERMINATED BY '\003'LINES TERMINATED BY '\n'STORED AS TEXTFILE;


二、DDL
如果没有指定数据库,就默认是defualt数据库

1.创建数据库看

create database database_name;create database database_name if not exists database_name;

可以使用schema代替database

2.查看database
show databases;describe database database_name;

支持正则表达式
show databases like 'def*'

3.存储位置
默认情况下,数据库目录被创建在
属性hive.metastore.warehouse.dir指定的目录。当然可以显示指定另外存储位置

create database database_namelocation '/home/mclaren/databases'


4.增加注释
create table database_namecomment 'the database_name i create';


另一种新增注释的方法,可以添加key-value到数据库。这样只有describe database extended database_NAME可以看到
create database database_namewith dbproperties('create'='mclaren pan', 'data'='2013-12-01');


describe database extended database_name;

5.转换工作数据库
use database_name;

6.设置属性
可以在进入hive命令行模式下设置属性
set hive.cli.print.current.db=true

7.删除数据库
drop database if exists database_name;

8.修改数据库
alert database database_name set dbproperties ('hehe', 'shit');

9.建表
CREATE TABLE IF NOT EXISTS mydb.employees (name STRING COMMENT 'Employee name',salary FLOAT COMMENT 'Employee salary',subordinates ARRAY<STRING> COMMENT 'Names of subordinates',deductions MAP<STRING, FLOAT>COMMENT 'Keys are deductions names, values are percentages',address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>COMMENT 'Home address')COMMENT 'Description of the table'TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)LOCATION '/user/hive/warehouse/mydb.db/employees';

10.外部表
这里的外部就是表的location不再本地,可以在hdfs的任何地方
例子:

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (exchange STRING,symbol STRING,ymd STRING,price_open FLOAT,price_high FLOAT,price_low FLOAT,price_close FLOAT,volume INT,price_adj_close FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/data/stocks';

这里的外部是/data/stocks
到底是受管表还是外部表,可以使用describe extended table_name看到。
外部表的一些特性:
1.因为是外部的,所以hive并不拥有数据,删除表并不会删除数据,但是metadata会被删除。
2.有些HiveQL的构造不能使用外部表

和受管表一样,schema(但是数据不行)也一样可以复制
例如
create external table if not exists mydb.employee2like mydb.employeelocation '/home/mclaren/data';

11.分区管理表
分区主要是为了水平地分布式加载,移动用户常用的数据到跟接近它们的地方。或是其他目的
使用分区会以要分区的那个字段的具体值来命名一个子目录,例如
create tabel table1(  col1 string,  col2 string,  col3 int,  col4 struct<subcol1:string, subcol2:string>)partitioned by (col1 string, subcol2 string)

如下目录结构会被创建
%WAREHOUSE%/table1
%WAREHOUSE%/table1/col1=**/subcol2=**
%WAREHOUSE%/table1/col1=**/subcol2=***
%WAREHOUSE%/table1/col1=***/subcol2=**
%WAREHOUSE%/table1/col1=***/subcol2=***
这样做的好处是为了提升查询效率,如果没有指定where查询条件,就会扫描全部的子目录,一般这种情况很少,但是一旦发生,就会触发一个巨大的MapReduce作业,这会是非常消耗资源的。为了避免这种情况发生。可以设置属性。
set hive.mapred.mode=strict;

这样设置后,如果进行无where查询会爆出
FAILED: Error in semantic analysis: No partition predicate found for
Alias "*" Table "****"
恢复这样设置
set hive.mapred.mode=nostrict

查看分区
show partitions table_name;show partitions table_name partition(column='***')describe extended table_name;

12.外部表分区
create external table if not exists ext_part_tbl (  col1 string,  col2 string,  col3 string)partitioned  by (col4 string, col5 string)row format delimited fields terminated by '\t';
alter table table_name add partition(col4='**', col5='**') 

13.定制表存储格式
CREATE TABLE employees (name STRING,salary FLOAT,subordinates ARRAY<STRING>,deductions MAP<STRING, FLOAT>,address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\001'COLLECTION ITEMS TERMINATED BY '\002'MAP KEYS TERMINATED BY '\003'LINES TERMINATED BY '\n'STORED AS TEXTFILE;

TEXTFILE意味着每一行就是一个数据,除了TEXTFILE还可以使用SEQUENCEFILE和RCFILE,后面两个都被磁盘和网络IO优化过。除此之外,还可以定制输入处理类和输出处理类。
CREATE TABLE kstPARTITIONED BY (ds string)ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc')STORED ASINPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';

14.删除表
drop table if exists table_name;

15.修改表
ALTER TABLE log_messages ADD IF NOT EXISTSPARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'...;

16.修改列
alter table table_namechange column old_col new_col stringcomment 'rename column old_col to new_col'after other_col;

移动new_col到other_col后

alter table table_namechange column old_col new_col stringcomment 'rename column old_col to new_col'first other_col;

17.增加列
alert table table_name add columns (  col1 string comment 'first col',  col2 string comment 'second col');

18.删除列
alter table table_name replace columns (  col1 string comment 'first col',  col2 string comment 'second col')

19.修改表属性
alter table table_name set tableproperties (  'prop1'='this is prop1')


20.修改存储属性
alter table table_namepartition (col1=**,col2=*)set fileformat sequencefile;ALTER TABLE table_using_JSON_storageSET SERDE 'com.example.JSONSerDe'WITH SERDEPROPERTIES ('prop1' = 'value1','prop2' = 'value2');

阻止分区被删掉
alter table table_namepartition(col1=**, col2=**) enable no_drop;

阻止分区被查询
alter table table_namepartition(col1=*, col2=*) enable offline;

三、DML
1.加载数据到受管表
load data local inpath '${env:HOME}/california-employees'overwirte into table employeespartition (country = 'US', state = 'CA')

这命令首先创建这个分区的目录,然后复制数据到这个目录
注意:
1.如果表没有分区,就不要使用partition语句
2.如果没有使用关键字local,则不会从本地文件系统取得文件放到目标目录,而是从分布式文件系统中取得。
3.不能从一个集群拷贝到另外一个集群。

2.从查询语句插入数据到表
insert overwrite table employeespartition (county = 'US', state = 'OR')select * from staged_employeess sewhere se.cnty = 'US' and se.st = 'OR'

from staged_employees seinsert overwrite table employeespartition (county = 'US', state = 'OR')select * where se.cnty = 'US' and se.st = 'OR'insert overwrite table employeespartition (county = 'US', state = 'CA')select * where se.cnty = 'US' and se.st = 'CA'

3.动态分区插入
上上面那样会有多条sql,

insert overwrite table employeespartition (county = 'US', state)select ..., se.cnty, se.stfrom staged_employees sewhere se.cnty = 'US'

4.创建表并且通过query语句加载数据

create table ca_employeesas select name, salary, addressfrom employeeswhere se.state = 'CA'

create table ca_employeesas select name, salary, addressfrom employeeswhere se.state = 'CA'

5.导出数据
方式一:
使用hadoop命令
insert overwrite local directory '/home/mclaren/data'select name, salary, addressfrom employeeswhere se.state = 'CA';

hadoop fs -cp source_ath target_path


hadoop fs -cp source_ath target_path

方式二:
insert overwrite local directory '/home/mclaren/data'select name, salary, addressfrom employeeswhere se.state = 'CA';

FROM staged_employees seINSERT OVERWRITE DIRECTORY '/tmp/or_employees'SELECT * WHERE se.cty = 'US' and se.st = 'OR'INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'SELECT * WHERE se.cty = 'US' and se.st = 'CA'INSERT OVERWRITE DIRECTORY '/tmp/il_employees'SELECT * WHERE se.cty = 'US' and se.st = 'IL';

6.删除数据,但保留表结构
也就是删除warehouse目录下的子文件夹
在hive命令行下:



四、HiveQL 查询
如果字段是Array,则可以使用 字段[索引]来取得数组中的某个元素
如果字段是Map,则可以使用 字段[key] 来取的某个元素
dfs -rmr /home/mclaren/datascope/hvie/warehouse/t_order;

用正则表达式指定列
例如:
select symbol, 'price.*' from stocks;

计算列值可以使用到的列值


可以通过设置提升聚合性能
set hive.map.aggr=true
这个设置将会导致一个顶级的聚合map


分页

SELECT upper(name), salary, deductions["Federal Taxes"], round(salary * (1 - deductions["Federal Taxes"])) FROM employees LIMIT 2;

列别名
SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes FROM employees LIMIT 2;

嵌套语句
FROM ( SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes FROM employees ) e SELECT e.name, e.salary_minus_fed_taxes WHERE e.salary_minus_fed_taxes > 70000;case when thenSELECT name, salary, CASE WHEN salary < 50000.0 THEN 'low'SELECT … FROM Clauses | 91 WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle' WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high' ELSE 'very high' END AS bracket FROM employees;

是否触发MapReduce
在hive中,有些查询没有用到MapReduce,这就是所谓的本地模式
例如下面查询
select * from employees;

当where语句中以分区字段过滤时


当像如下设置的时候
set hive.exec.mode.local.auth=true

hive将按照本地模式运行
否则会使用MapReduce

where语句
注意:不能在where语句中使用列别名


like 和 rlike
rlike是正则表达式

SELECT name, address.street FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';

group by 语句
SELECT year(ymd), avg(price_close) FROM stocks WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' GROUP BY year(ymd);

having语句
SELECT year(ymd), avg(price_close) FROM stocks WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' GROUP BY year(ymd) HAVING avg(price_close) > 50.0;

join语句
SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd = b.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

SELECT a.ymd, a.price_close, b.price_close , c.price_close FROM stocks a JOIN stocks b ON a.ymd = b.ymd JOIN stocks c ON a.ymd = c.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';

左外连接
SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL';

右外联结
SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol WHERE s.symbol = 'AAPL';

全连接
SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol WHERE s.symbol = 'AAPL';

左半连接
SELECT s.ymd, s.symbol, s.price_close FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;


0 0
原创粉丝点击