Hive学习笔记

来源：互联网发布：淘宝网电动葫芦编辑：程序博客网时间：2024/06/13 09:14

环境描述：

Hadoop集群版本：hadoop-1.2.1

Hive版本：hive-0.10.0

Hive在使用时只在一个节点上安装即可。

一、Hive安装配置

1.上传hive压缩包（hive-0.10.0-bin.tar.gz）hadoop集群的某个节点服务器，解压安装：
tar -zxvf hive-0.10.0.tar.gz -C /home/suh/

2.修改hive环境配置文件hive-env.sh，增加以下配置，指明hadoop安装路径：（测试好像可以不用指明，也行）

export HADOOP_HOME=/home/suh/hadoop-1.2.1

3.配置hive 使用MySQL数据库保存 metastore

将默认配置文件模板重命名，然后增加相应配置：

cp hive-default.xml.template hive-site.xml

修改hive-site.xml（将<property></property> 对都删除）
添加如下内容：
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://boss:3306/hive_test?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>

4.以上配置hive完成后，将mysql的连接驱动jar包拷贝到$HIVE_HOME/lib目录下
如果出现没有权限的问题，在mysql授权(在安装mysql的机器上执行)
mysql -uroot -p
#(执行下面的语句 *.*:所有库下的所有表 %：任何IP地址或主机都可以连接)
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123' WITH GRANT OPTION;
FLUSH PRIVILEGES;

注意：把mysql的数据库字符类型改为latin1，否则show table 时候就开始报错。

二、Hive 使用

进入到$HIVE_HOME/bin目录，执行命令：./hive 进入到hive模式，接下里的操作就同mysql类似。

1.建表(默认是内部表)
create table trade_detail(id bigint, account string, income double, expenses double, time string)
row format delimited fields terminated by '\t';

建分区表
create table trade(tradedate string,tradetime string,securityid string,bidpx1 double,bidsize1 string,offerpx1 double,offersize1 string)
partitioned by (trade_date string)
row format delimited fields terminated by ',';

建外部表
create external table td_ext(id bigint, account string, income double, expenses double, time string)
row format delimited fields terminated by '\t'
location '/td_ext';

2、Hive中的三种不同的数据导出方式
（1、导出到本地文件系统：
insert overwrite local directory '/home/suh/hive/trade_01' select * from trade where tradedate='20130726';

PS:这条HQL的执行需要启用Mapreduce完成，运行完这条语句之后，将会在本地文件系统的/home/suh/hive/trade_01目录下生成文件。
这个文件是Reduce产生的结果（这里生成的文件名是000000_0）,数据中的列与列之间的分隔符是^A(ascii码是\00001)。

（2、导出到HDFS中:
insert overwrite directory '/user/trade02' select * from trade where tradedate='20130725';

PS:将会在HDFS的/user/trade02 目录下保存导出来的数据（这里生成的文件名是000000_0），数据中的列与列之间的分隔符是^A(ascii码是\00001)。
和导出文件到本地文件系统的HQL少一个local，数据的存放路径就不一样了。

（3、导出到Hive的另一个表中：
insert into table trade_test partition(trade_date='20130724') select tradedate,tradetime,securityid,bidpx1,bidsize1,offerpx1,offersize1 from trade where tradedate='20130724';

select tradedate,tradetime,securityid,bidpx1,offerpx1 from trade_test where tradedate='20130724';
PS:前提是trade_test已经存在。

（4、导出后续补充学习：
在hive0.11.0版本后新引进了一个新的特性，也就是当用户将hive查询结果输出到文件，用户可以指定使用的列的分隔符，而在之前的版本中是不能指定列之间的分隔符的。
例如：
insert overwrite local directory '/home/suh/hive/trade_01' row format delimited fields terminated by '\t' select * from trade;

还可以用hive的-e和-f参数来导出数据，其中-e表示后面直接带双引号的sql语句；而-f是接一个文件，文件的内容为一个sql语句。如下所示：
执行：./hive -e "select * from trade" >> /home/suh/hive/trade001.txt
或
执行：./hive -f /home/suh/hive/SQL.sql >> /home/suh/hive/trade002.txt

三、实际业务案例操作：
（1、创建交易数据表及临时表：
create table trade(tradedate string,tradetime string,securityid string,bidpx1 double,bidsize1 string,offerpx1 double,offersize1 string)partitioned by(trade_date string) row format delimited fields terminated by ',';

create table trade_tmp(tradedate string,tradetime string,securityid string,bidpx1 double,bidsize1 string,offerpx1 double,offersize1 string) row format delimited fields terminated by ',';

（2、导入交易数据集文件total.csv到Hive中，用日期做为分区表的分区ID：
由于交易记录文件total.csv里的数据是多个日期的记录，所以先导入到临时表trade_tmp，然后再从临时表中导入到正式的trade 分区表中
导入到临时表trade_tmp：
load data local inpath '/home/suh/hive/total.csv' overwrite into table trade_tmp;

从临时表中导入到正式的trade 分区表：
insert into table trade partition(trade_date='20130724') select tradedate,tradetime,securityid,bidpx1,bidsize1,offerpx1,offersize1 from trade_tmp where tradedate='20130724';
insert into table trade partition(trade_date='20130725') select tradedate,tradetime,securityid,bidpx1,bidsize1,offerpx1,offersize1 from trade_tmp where tradedate='20130725';
insert into table trade partition(trade_date='20130726') select tradedate,tradetime,securityid,bidpx1,bidsize1,offerpx1,offersize1 from trade_tmp where tradedate='20130726';

（3、按securityid分组，分别统计每个产品每日的最高价和最低价：
select tradedate,securityid,max(bidpx1),min(bidpx1),max(offerpx1),min(offerpx1)from trade group by tradedate , securityid;

（4、按securityid分组，以分钟做为最小单位，求204001的任意1日的每分钟均价：
select tradedate,securityid,AVG(bidpx1),AVG(offerpx1) from trade where securityid='204001' group by tradedate,securityid;

0 0