Hive安装配置+练习

来源:互联网 发布:linux 系统启动脚本 编辑:程序博客网 时间:2024/05/22 13:40

目录

一. Hive的安装

  1. hive配置步骤

    • step1
    • setp2
    • step3
    • step4
    • step5
  2. 启动hive

  3. 启动metastore服务
  4. 启动hiveserver服务
  5. 启动hive客户端
  6. 查询MySQL数据库中的元数据

二. Hive的基本使用

 1. 进入hive控制台    2.  新建表           创建新表        导入数据t_hive.txt到t_hive表        3. 查看表和数据          查看表         正则匹配表名          查看表数据           查看表结构    4. 修改表     -增加一个字段     -重命令表名   5.  删除表

三. Hive交互式模式

四. 数据导入

1. 从操作本地文件系统加载数据(LOCAL)     -在HDFS中查找刚刚导入的数据        -从HDFS加载数据  

五. 数据导出

 1. 从HDFS复制到HDFS其他位置     2. 通过Hive导出到本地文件系统  3. 查看本地操作系统    

六. Hive查询HiveQL

1. 普通查询:排序,列别名,嵌套子查询    2. 连接查询:JOIN    3. 聚合查询1:count, avg 4. 聚合查询2:count, distinct    5. 聚合查询3:GROUP BY, HAVING

七. Hive视图

-删除视图   

八. Hive分区表

1. 创建数据 2. 创建数据表    3. 创建分区数据表  4. 导入数据 5. 查看分区表    6. 查询数据 

前言

Hive是Hadoop一个程序接口,Hive让数据分析人员快速上手,Hive使用了类SQL的语法,Hive让JAVA的世界变得简单而轻巧,Hive让Hadoop普及到了程序员以外的人。
从Hive开始,让分析师们也能玩转大数据。
一. Hive的安装

系统环境

装好hadoop的环境后,我们可以把Hive装在namenode机器上(c1)。
hadoop的环境,请参考:让Hadoop跑在云端系列文章,RHadoop实践系列之一:Hadoop环境搭建
下载: apache-hive-1.2.2-bin.tar.gz
解压到:/home/cos/apache-hive-1.2.2-bin.tar.gz(解压到哪一个目录自己定义)
hive配置

1、step1
(1)、解压安装包
tar -zxvf apache-hive-1.2.2-bin.tar.gz
(2)、添加环境变量
[user@user ~]$ vim .bashrc
导入下面的环境变量

 export HIVE_HOME=自己hive的解压路径 export PATH=$PATH:$HIVE_HOME/bin

在安装之前,需要先安装MySql,Hive默认使用内嵌的DerBy作为元数据库,但是鉴于其单会话限制,这里使用Mysql作为元数据库
检查是否装有Mysql:
netstat -tap | grep mysql
如果有输出,说明已经有mysql,如果没有需要安装

2、setp2
安装Mysql,装完Mysql的跳过这一步
[user@user home]# wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
[user@user home]#rpm -ivh mysql-community-release-el7-5.noarch.rpm
[user@user home]# yum install mysql-community-server

已加载插件:fastestmirror, langpacks
base | 3.6 kB 00:00
extras | 3.4 kB 00:00
mysql-connectors-community | 2.5 kB 00:00
mysql-tools-community | 2.5 kB 00:00
mysql56-community | 2.5 kB 00:00
updates | 3.4 kB 00:00
(1/4): mysql-connectors-community/x86_64/primary_db | 14 kB 00:03
(2/4): mysql-tools-community/x86_64/primary_db | 33 kB 00:03
(3/4): mysql56-community/x86_64/primary_db | 168 kB 00:04
(4/4): updates/7/x86_64/primary_db | 4.8 MB 00:14
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* extras: mirrors.aliyun.com
* updates: mirrors.aliyun.com
正在解决依赖关系
–> 正在检查事务
……
……
验证中 : mysql-community-libs-5.6.36-2.el7.x86_64 11/11

已安装:
mysql-community-server.x86_64 0:5.6.36-2.el7

作为依赖被安装:
mysql-community-client.x86_64 0:5.6.36-2.el7
mysql-community-common.x86_64 0:5.6.36-2.el7
mysql-community-libs.x86_64 0:5.6.36-2.el7
perl-Compress-Raw-Bzip2.x86_64 0:2.061-3.el7
perl-Compress-Raw-Zlib.x86_64 1:2.061-4.el7
perl-DBI.x86_64 0:1.627-4.el7
perl-Data-Dumper.x86_64 0:2.145-3.el7
perl-IO-Compress.noarch 0:2.061-2.el7
perl-Net-Daemon.noarch 0:0.48-5.el7
perl-PlRPC.noarch 0:0.2020-14.el7

完毕!

3、step3
启动mysql服务
[user@localhost ~]$ service mysqld start

[user@localhost ~]$ mysql -u root

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.6.36 MySQL Community Server (GPL)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

//添加root用户的密码
mysql> update user set password = password(‘123456’) where >user=’root’;

Query OK, 4 rows affected (0.00 sec)
Rows matched: 4 Changed: 4 Warnings: 0

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> exit
Bye

[user@localhost ~]$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.6.36 MySQL Community Server (GPL)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

创建hive用户,数据库等
mysql> use mysql;
mysql> GRANT USAGE ON . TO ‘hive’@’localhost’ IDENTIFIED BY ‘hive’ WITH GRANT OPTION;
mysql> grant all on hive.* to hive@’%’ identified by ‘hive’;
mysql> grant all on hive.* to hive@’localhost’ identified by ‘hive’;
mysql> create database hive;
mysql> flush privileges;
mysql> ALTER DATABASE hive CHARACTER SET latin1;

4、step4
配置hive中的文件

[user@localhost conf]$ sudo cp hive-default.xml.template hive-site.xml[sudo] password for user:[user@localhost conf]$ lsbeeline-log4j.properties.template    hive-log4j.properties.templatehive-default.xml.template            hive-site.xmlhive-env.sh.template                 ivysettings.xmlhive-exec-log4j.properties.template[user@localhost conf]$ sudo vim hive-site.xml[sudo] password for user:

主要修改以下参数

<property>    <name>javax.jdo.option.ConnectionURL </name>    <value>jdbc:mysql://localhost:3306/hive </value> //hive 为自己新建数据库的名字</property> <property>    <name>javax.jdo.option.ConnectionDriverName </name>    <value>com.mysql.jdbc.Driver </value> //驱动</property><property>    <name>javax.jdo.option.ConnectionPassword </name>    <value>hive </value>  //数据库密码</property> <property>    <name>datanucleus.autoCreateSchema </name>    <value>true</value> </property>  <property>    <name>javax.jdo.option.ConnectionUserName</name>    <value>hive</value> //数据库用户名    <description>Username to use against metastore database</description>  </property>  <property>    <name>hive.exec.local.scratchdir</name>    <value>/home/hdpsrc/hive/iotmp</value>// 自定义目录    <description>Local scratch space for Hive jobs</description>  </property>  <property>    <name>hive.downloaded.resources.dir</name>    <value>/home/hdpsrc/hive/iotmp</value> //自定义目录    <description>Temporary local directory for added resources in the remote file system.</description>  </property>  <property>    <name>hive.querylog.location</name>    <value>/home/hdpsrc/hive/iotmp</value> 自定义目录    <description>Location of Hive run time structured log file</description>  </property><property> <name>hive.server2.logging.operation.log.location</name>     <value>自定义目录/operation_logs</value>//注意operation_logs文件夹也需要自己新建     <description>Top level directory where operation logs are stored if logging functionality is enabled</description></property>[user@localhost conf]$ sudo cp hive-log4j.properties.template hive-log4j.properties[sudo] password for user:[user@localhost conf]$ sudo vim hive-log4j.properties#log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounterlog4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

配置完后以上文件后,把mysql-connector的jar包放到hive的lib目录下,这一步不能少

5、step5
启动hadoop

[user@localhost hadoop-2.7.3]$  ./sbin/start-all.shThis script is Deprecated. Instead use start-dfs.sh and start-yarn.shStarting namenodes on [localhost]localhost: namenode running as process 3931. Stop it first.localhost: datanode running as process 4055. Stop it first.Starting secondary namenodes [0.0.0.0]0.0.0.0: secondarynamenode running as process 4255. Stop it first.starting yarn daemonsresourcemanager running as process 4451. Stop it first.localhost: nodemanager running as process 4566. Stop it first.[user@localhost hadoop-2.7.3]$ jps4451 ResourceManager4566 NodeManager4055 DataNode3931 NameNode10061 Jps4255 SecondaryNameNode[user@localhost apache-hive-1.2.2-bin]$ hivehive> show databases;OKdefaultTime taken: 1.926 seconds, Fetched: 1 row(s)

可能出现的问题:
1. 在安装mysql数据库是出现:/var/run/yum.pid已被锁定,PID为1610的另一个程序正在运行。
解决办法 rm -f /var/run/yum.pid
删除文件后再次运行yum可用。
2. hadoop起来后,输入hive,报RuntimeException的异常。
尝试换个其他版本的mysql-contentor jar包,jar包从windows系统直接拖进虚拟机可以会出现问题,建议从虚拟机里下jar包

接下来,在hdfs上面,创建目录

$HADOOP_HOME    /bin/hadoop fs -mkdir /tmp$HADOOP_HOME    /bin/hadoop fs -mkdir /user/warehouse//如果报错,输入hadoop fs -mkdir -p /user/warehouse$HADOOP_HOME    /bin/hadoop fs -chmod g+w /tmp //赋予权限$HADOOP_HOME    /bin/hadoop fs -chmod g+w /user/warehouse//赋予权限

启动hive
启动metastore服务
~ bin/hive –service metastore &

输出信息:Starting Hive Metastore Server…

启动hiveserver服务
~ bin/hive –service hiveserver2 &
输出信息:Starting Hive Thrift Server…

启动hive客户端
~ bin/hive shell

hive> show tablesOK查询MySQL数据库中的元数据~ mysql –u root -pmysql> use hive;Database changedmysql> show tables;+-------------------------+| Tables_in_hive |+-------------------------+| BUCKETING_COLS          || CDS                     || COLUMNS_V2              || DATABASE_PARAMS         || DBS                     || IDXS                    || INDEX_PARAMS            || PARTITIONS              || PARTITION_KEYS          || PARTITION_KEY_VALS      || PARTITION_PARAMS        || PART_COL_PRIVS          || PART_PRIVS              || SDS                     || SD_PARAMS               || SEQUENCE_TABLE          || SERDES                  || SERDE_PARAMS            || SORT_COLS               || TABLE_PARAMS            || TBLS                    || TBL_COL_PRIVS           || TBL_PRIVS               |+-------------------------+23 rows in set (0.00 sec)

Hive已经成功安装,下面是hive的使用攻略。

二. Hive的基本使用

  1. 进入hive控制台
~ cd cd /home/apache-hive-1.2.2-bin/~ bin/hive shell

输出:Logging initialized using configuration in。。。

2.新建表

创建数据(文本以tab分隔)

sudo vim /home/hive/t_hive.txt //新建t_hive.txt文件,位置自己定

输入数据,数字之间距离为tap间隔(直接粘贴,上传到hive上可能会出现问题)

16      2       361      12      1341      2       3117      21      371      2       311       12      3411      2       34

创建新表

hive> CREATE TABLE t_hive (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';OKTime taken: 0.489 seconds>导入数据t_hive.txt到t_hive表hive> LOAD DATA LOCAL INPATH '/home/hive/t_hive.txt' OVERWRITE INTO TABLE t_hive ;>Loading data to table default.t_hive>OKTime taken: 0.397 seconds>**3.查看表和数据**    **查看表** hive> show tables;OKt_hiveTime taken: 0.099 seconds>**正则匹配表名**hive>show tables '*t*';OKt_hiveTime taken: 0.065 seconds>**查看表数据**
hive> select * from t_hive;OK16      2       361      12      1341      2       3117      21      371      2       311       12      3411      2       34

Time taken: 0.264 seconds

查看表结构

hive> desc t_hive;OKa       intb       intc       int

Time taken: 0.1 seconds

4.修改表

增加一个字段
hive> ALTER TABLE t_hive ADD COLUMNS (new_col String);OKTime taken: 0.186 secondshive> desc t_hive;OKa       intb       intc       intnew_col stringTime taken: 0.086 seconds

重命令表名

~ ALTER TABLE t_hive RENAME TO t_hadoop;OKTime taken: 0.45 secondshive> show tables;OKt_hadoopTime taken: 0.07 seconds

5.删除表

hive> DROP TABLE t_hadoop;OKTime taken: 0.767 secondshive> show tables;OKTime taken: 0.064 seconds

三. Hive交互式模式

•   quit,exit:  退出交互式shell•   reset: 重置配置为默认值•   set <key>=<value> : 修改特定变量的值(如果变量名拼写错误,不会报错)•   set :  输出用户覆盖的hive配置变量•   set -v : 输出所有Hadoop和Hive的配置变量•   add FILE[S] *, add JAR[S] *, add ARCHIVE[S] * : 添加 一个或多个 file, jar, archives到分布式缓存•   list FILE[S], list JAR[S], list ARCHIVE[S] : 输出已经添加到分布式缓存的资源。•   list FILE[S] *, list JAR[S] *,list ARCHIVE[S] * : 检查给定的资源是否添加到分布式缓存•   delete FILE[S] *,delete JAR[S] *,delete ARCHIVE[S] * : 从分布式缓存删除指定的资源•   ! <command> :  从Hive shell执行一个shell命令•   dfs <dfs command> :  从Hive shell执行一个dfs命令•   <query string> : 执行一个Hive 查询,然后输出结果到标准输出•   source FILE <filepath>:  在CLI里执行一个hive脚本文件

四. 数据导入
还以刚才的t_hive为例。

创建表结构
hive> CREATE TABLE t_hive (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
1.从操作本地文件系统加载数据(LOCAL)

hive> LOAD DATA LOCAL INPATH '/home/hive/t_hive.txt' OVERWRITE INTO TABLE t_hive ;Loading data to table default.t_hiveOKTime taken: 0.612 seconds

在HDFS中查找刚刚导入的数据
[user@localhost hadoop-2.7.3]cdbin/[user@localhostbin] hadoop fs -cat /user/warehouse/t_hive/t_hive.txt
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
从HDFS加载数据
创建表t_hive2
hive> CREATE TABLE t_hive2 (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

从HDFS加载数据
hive> LOAD DATA INPATH ‘/user/warehouse/t_hive/t_hive.txt’ OVERWRITE INTO TABLE t_hive2;
Loading data to table default.t_hive2
Table default.t_hive2 stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
OK
Time taken: 2.649 seconds
hive>

查看数据

hive> select * from t_hive2;OK16      2       361      12      1341      2       3117      21      371      2       311       12      3411      2       34Time taken: 0.287 seconds

从其他表导入数据

hive> INSERT OVERWRITE TABLE t_hive2 SELECT * FROM t_hive ;Total MapReduce jobs = 2Launching Job 1 out of 2Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_201307131407_0002, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0002Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0002Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02013-07-16 10:32:41,979 Stage-1 map = 0%,  reduce = 0%2013-07-16 10:32:48,034 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:49,050 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:50,068 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:51,082 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:52,093 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:53,102 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec2013-07-16 10:32:54,112 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.03 secMapReduce Total cumulative CPU time: 1 seconds 30 msecEnded Job = job_201307131407_0002Ended Job = -314818888, job is filtered out (removed at runtime).Moving data to: hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-32-31_323_5732404975764014154/-ext-10000Loading data to table default.t_hive2Deleted hdfs://c1.wtmart.com:9000/user/hive/warehouse/t_hive2Table default.t_hive2 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 56, raw_data_size: 0]7 Rows loaded to t_hive2MapReduce Jobs Launched:Job 0: Map: 1   Cumulative CPU: 1.03 sec   HDFS Read: 273 HDFS Write: 56 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 30 msecOKTime taken: 23.227 seconds
hive> select * from t_hive2;OK16      2       361      12      1341      2       3117      21      371      2       311       12      3411      2       34Time taken: 0.134 seconds

创建表并从其他表导入数据

删除表

hive> DROP TABLE t_hive;

创建表并从其他表导入数据

hive> CREATE TABLE t_hive AS SELECT * FROM t_hive2 ;Total MapReduce jobs = 2Launching Job 1 out of 2Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_201307131407_0003, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0003Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02013-07-16 10:36:48,612 Stage-1 map = 0%,  reduce = 0%2013-07-16 10:36:54,648 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13 sec2013-07-16 10:36:55,657 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13 sec2013-07-16 10:36:56,666 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13 sec2013-07-16 10:36:57,673 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13 sec2013-07-16 10:36:58,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.13 sec2013-07-16 10:36:59,691 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.13 secMapReduce Total cumulative CPU time: 1 seconds 130 msecEnded Job = job_201307131407_0003Ended Job = -670956236, job is filtered out (removed at runtime).Moving data to: hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-36-39_986_1343249562812540343/-ext-10001Moving data to: hdfs://c1.wtmart.com:9000/user/hive/warehouse/t_hiveTable default.t_hive stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 56, raw_data_size: 0]7 Rows loaded to hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-36-39_986_1343249562812540343/-ext-10000MapReduce Jobs Launched:Job 0: Map: 1   Cumulative CPU: 1.13 sec   HDFS Read: 272 HDFS Write: 56 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 130 msecOKTime taken: 20.13 secondshive> select * from t_hive;OK16      2       361      12      1341      2       3117      21      371      2       311       12      3411      2       34Time taken: 0.109 seconds

仅复制表结构不导数据

hive> CREATE TABLE t_hive3 LIKE t_hive;hive> select * from t_hive3;OKTime taken: 0.077 seconds

从MySQL数据库导入数据
我们将在介绍Sqoop时讲。

五. 数据导出
从HDFS复制到HDFS其他位置

~ hadoop fs -cp /user/warehouse/t_hive/~ hadoop fs -ls /t_hiveFound 1 items-rw-r--r--   1 cos supergroup         56 2013-07-16 10:41 /t_hive/000000_0~ hadoop fs -cat /t_hive/000000_016236112134123117213712311123411234

通过Hive导出到本地文件系统

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/t_hive' SELECT * FROM t_hive;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_201307131407_0005, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0005Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0005Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02013-07-16 10:46:24,774 Stage-1 map = 0%,  reduce = 0%2013-07-16 10:46:30,823 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:31,833 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:32,844 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:33,856 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:34,865 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:35,873 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec2013-07-16 10:46:36,884 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 0.87 secMapReduce Total cumulative CPU time: 870 msecEnded Job = job_201307131407_0005Copying data to local directory /tmp/t_hiveCopying data to local directory /tmp/t_hive7 Rows loaded to /tmp/t_hiveMapReduce Jobs Launched:Job 0: Map: 1   Cumulative CPU: 0.87 sec   HDFS Read: 271 HDFS Write: 56 SUCCESSTotal MapReduce CPU Time Spent: 870 msecOKTime taken: 23.369 seconds

查看本地操作系统
hive> ! cat /tmp/t_hive/000000_0;
hive> 1623
611213
41231
17213
71231
11234
11234

六. Hive查询HiveQL
注:以下代码将去掉map,reduce的日志输出部分。
普通查询:排序,列别名,嵌套子查询

hive> FROM (

SELECT b,c as c2 FROM t_hive
) t
SELECT t.b, t.c2
WHERE b>2
LIMIT 2;
12 13
21 3
连接查询:JOIN

hive> SELECT t1.a,t1.b,t2.a,t2.b

FROM t_hive t1 JOIN t_hive2 t2 on t1.a=t2.a
WHERE t1.c>10;

1 12 1 12
11 2 11 2
41 2 41 2
61 12 61 12
71 2 71 2
聚合查询1:count, avg

hive> SELECT count(*), avg(a) FROM t_hive;
7 31.142857142857142
聚合查询2:count, distinct

hive> SELECT count(DISTINCT b) FROM t_hive;
3
聚合查询3:GROUP BY, HAVING

GROUP BY

hive> SELECT avg(a),b,sum(c) FROM t_hive GROUP BY b,c
16.0 2 3
56.0 2 62
11.0 2 34
61.0 12 13
1.0 12 34
17.0 21 3

HAVING

hive> SELECT avg(a),b,sum(c) FROM t_hive GROUP BY b,c HAVING sum(c)>30
56.0 2 62
11.0 2 34
1.0 12 34
七. Hive视图
Hive视图和数据库视图的概念是一样的,我们还以t_hive为例。

hive> CREATE VIEW v_hive AS SELECT a,b FROM t_hive where c>30;
hive> select * from v_hive;
41 2
71 2
1 12
11 2
删除视图

hive> DROP VIEW IF EXISTS v_hive;
OK
Time taken: 0.495 seconds
八. Hive分区表
分区表是数据库的基本概念,但很多时候数据量不大,我们完全用不到分区表。Hive是一种OLAP数据仓库软件,涉及的数据量是非常大的,所以分区表在这个场景就显得非常重要!!
下面我们重新定义一个数据表结构:t_hft
创建数据

~ sudo vim /home/cos/demo/t_hft_20130627.csv
000001,092023,9.76
000002,091947,8.99
000004,092002,9.79
000005,091514,2.2
000001,092008,9.70
000001,092059,9.45

~ vi /home/cos/demo/t_hft_20130628.csv
000001,092023,9.76
000002,091947,8.99
000004,092002,9.79
000005,091514,2.2
000001,092008,9.70
000001,092059,9.45
创建数据表

DROP TABLE IF EXISTS t_hft;
CREATE TABLE t_hft(
SecurityID STRING,
tradeTime STRING,
PreClosePx DOUBLE
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
创建分区数据表
根据业务:按天和股票ID进行分区设计

DROP TABLE IF EXISTS t_hft;
CREATE TABLE t_hft(
SecurityID STRING,
tradeTime STRING,
PreClosePx DOUBLE
) PARTITIONED BY (tradeDate INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
导入数据

20130627

hive> LOAD DATA LOCAL INPATH ‘/home/cos/demo/t_hft_20130627.csv’ OVERWRITE INTO TABLE t_hft PARTITION (tradeDate=20130627);
Copying data from file:/home/cos/demo/t_hft_20130627.csv
Copying file: file:/home/cos/demo/t_hft_20130627.csv
Loading data to table default.t_hft partition (tradedate=20130627)

20130628

hive> LOAD DATA LOCAL INPATH ‘/home/cos/demo/t_hft_20130628.csv’ OVERWRITE INTO TABLE t_hft PARTITION (tradeDate=20130628);
Copying data from file:/home/cos/demo/t_hft_20130628.csv
Copying file: file:/home/cos/demo/t_hft_20130628.csv
Loading data to table default.t_hft partition (tradedate=20130628)

查看分区表

hive> SHOW PARTITIONS t_hft;
tradedate=20130627
tradedate=20130628
Time taken: 0.082 seconds
查询数据

hive> select * from t_hft where securityid=’000001’;
000001 092023 9.76 20130627
000001 092008 9.7 20130627
000001 092059 9.45 20130627
000001 092023 9.76 20130628
000001 092008 9.7 20130628
000001 092059 9.45 20130628

hive> select * from t_hft where tradedate=20130627 and PreClosePx<9;
000002 091947 8.99 20130627
000005 091514 2.2 20130627
Hive基于使用完成,这些都是日常的操作。后面我会继续讲一下,HiveQL优化及Hive的运维。

可能遇到的问题
1. hive –service hiveserver &
Exception in thread “main” java.lang.ClassNotFoundException:org.apache.hadoop.hive.service.HiveServer

  1. 这是原来表结构
    hive> desc t_hive;
    OK
    a int
    b int
    c int
    Time taken: 3.201 seconds, Fetched: 3 row(s)

想新增加一个字段,报错
hive> ALTER TABLE t_hive ADD COLUMNS (new_col String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table.
For direct MetaStore DB connections, we don’t support retries at the client level.

hive.log 报错内容
2017-05-19 14:53:41,036 ERROR [main]: ql.Driver (SessionState.java:printError(960)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. For direct MetaStore DB connections, we don’t support retries at the client level.
2017-05-19 14:53:41,036 INFO [main]: metadata.Hive (Hive.java:logDumpPhase(3294)) - Dumping metastore api call timing information for : execution phase
2017-05-19 14:53:41,037 INFO [main]: metadata.Hive (Hive.java:dumpAndClearMetaCallTiming(3285)) - Total time spent in this metastore function was greater than 1000ms : getTable_(String, String, )=2108
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -

解决方法:重新换一个版本的mysql驱动,替换掉hive的lib目录中之前的版本,我换成了5.1.41版本的,可以用。

原创粉丝点击