Hive I

来源：互联网发布：java登录界面与数据库编辑：程序博客网时间：2024/05/16 08:16

数据仓库

OLAP        //online analyze process.            //数量量大,并发低，延迟高。hive        //hadoop mr,效率高。sql         //类似sql语句。

数据库

mysql,OLTP        //在线事务处理。acid            //事务并发现象: dirty read | unrepeatable read | phantom read            //隔离级别:read uncommitted | read committed | repeatable read | serializable            //延迟低.

hive

蜜蜂。数据仓库。使用sql方式读、写、管理驻留在分布式存储系统上的大型数据集的数据仓库软件。命令行方式和driver方式。最初由facebook开发，处理结构化数据。[不是]    关系型数据库    OLTP    实时查询和行级更新。[特点]    存储schema文件在数据库中，处理的是hdfs数据。    OLAP    提供SQL语句，成为HiveQL / HQL    可伸缩、可扩展

hive概念

1.user interface    shell     web ui2.metastore    元数据库    存储库、表、列、类型等信息。    存放在关系数据库中。3.HiveQL prcoess engine    处理引擎，替代传统MR方式。4.执行引擎    处理查询，生成MR的结果。5.hdfs | hbase    存储存放地。

hive与hadoop的交互流程

client同hive driver(select id,name,age from customers ;)    -->通过编译器进行编译            -->查询metastore    -->执行引擎            -->hadoop mr

hive安装

1.下载2.tar    $>tar -xzvf apache-hive-2.1.0-bin.tar.gz -C /soft/    $>ln -s  apache-hive-2.1.0-bin hive3.环境变量    $>sudo nano /etc/profile    ...    HIVE_HOME=/soft/hive    PATH=$PATH:$HIVE_HOME/bin4.source /etc/profile

hive配置元数据到mysql中，不使用默认的derby库。

1.复制mysql驱动到hive的lib下.    cp mysql-connector-java-5.1.17.jar /soft/hive/lib2.创建hive配置文件    [/soft/hive/conf/hive-site.xml]

        <property>            <name>javax.jdo.option.ConnectionPassword</name>            <value>root</value>        </property>        <property>            <name>javax.jdo.option.ConnectionUserName</name>            <value>root</value>        </property>        <property>            <name>javax.jdo.option.ConnectionURL</name>            <value>jdbc:mysql://192.168.231.1:3306/hive6</value>        </property>        <property>            <name>javax.jdo.option.ConnectionDriverName</name>            <value>com.mysql.jdbc.Driver</value>        </property>        <property>            <name>hive.server2.authentication</name>            <value>NONE</value>        </property>        <property>            <name>hive.server2.enable.doAs</name>            <value>false</value>        </property>

初始化hive元数据到mysql

$>bin/schemaTool -dbType mysql -initSchema

hive的组件

database                //hdfs目录,/user/hive/warehouse/${dbname}.dbtable                   //hdfs目录,/user/hive/warehouse/${dbname}.db/${table}

进入shell

//进入hive命令行$>hive//hive操作$hive>create database mydb ;                                   //建库$hive>use mydb ;$hive>create table customers(id int,name string , age int) ;   //建表$hive>show database ;$hive>show tables ;$hive>desc customers ;

使用hive的hiveserver2服务实现hive cs访问

1.启动hiveserver2    $>hive/bin/hiveserver2 &2.查看hiveserver2是否启动完成    netstat -anop  | grep 100003.启动beeline命令行    $>hive/bin/beeline    $beeline>!help    $beeline>!connect jdbc:hive2://localhost:10000/mydb ;    $beeline>show databases ;

远程通过jdbc连接访问hiveserver2，进行hive数据仓库操作

1.创建模块添加依赖

<?xml version="1.0" encoding="UTF-8"?>        <project xmlns="http://maven.apache.org/POM/4.0.0"                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"                 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">            <modelVersion>4.0.0</modelVersion>            <groupId>com.it18zhang</groupId>            <artifactId>my-hive-day01</artifactId>            <version>1.0-SNAPSHOT</version>            <dependencies>                <dependency>                    <groupId>junit</groupId>                    <artifactId>junit</artifactId>                    <version>4.11</version>                </dependency>                <dependency>                    <groupId>org.apache.hive</groupId>                    <artifactId>hive-jdbc</artifactId>                    <version>2.1.0</version>                </dependency>            </dependencies>        </project>

2.编写测试代码

package com.hive.test;        import org.junit.Test;        import java.sql.Connection;        import java.sql.DriverManager;        import java.sql.ResultSet;        import java.sql.Statement;        /**         * 测试hive         */        public class TestHive {            @Test            public void getConn() throws Exception {                String driver= "org.apache.hive.jdbc.HiveDriver" ;                Class.forName(driver) ;                Connection conn = DriverManager.getConnection("jdbc:hive2://s100:10000/mydb") ;                Statement st = conn.createStatement();                ResultSet rs = st.executeQuery("select id,name,age from customers order by id desc");                while(rs.next()){                    int id = rs.getInt(1) ;                    String name = rs.getString(2) ;                    int age = rs.getInt(3) ;                    System.out.println(id + " ," + name + "," + age);                }                rs.close();                conn.close();            }        }

3.执行并查看web ui    mr.

hive数据类型

1.基本数据类型    TINYINT         //byte      1    SMALLINT        //short     2    INT             //int       4    BIGINT          //long      8    FLOAT           //float     4    DOUBLE          //double    8    DECIMAL         //decimal   精度和刻度decimal(10,3)    BINARY          //二进制       BOOLEAN         //TRUE | FALSE    STRING          //字符串    CHAR            //定长        <= 255    VARCHAR         //变长        <=65355.    DATE            //日期        '2013-01-01'    TIMESTAMP       //时间戳   ‘2013-01-01 12:00:01.345’2.复杂类型    ARRAY           //数组        ['apple','orange','mango']    MAP             //map       {1:"apple",2: "orange"}    STRUCT          //结构体   {1, "apple"}    NAMED STRUCT    //命名结构体{"apple":"gala","weightkg":1}    UNION           //组合        {2:["apple","orange"]}3.准备数据

4.创建复杂表    CREATE TABLE employee    (    name string,    arr ARRAY<string>,    struc STRUCT<sex:string,age:int>,    map1 MAP<string,int>,    map2 MAP<string,ARRAY<string>>    )    ROW FORMAT DELIMITED    FIELDS TERMINATED BY '|'    COLLECTION ITEMS TERMINATED BY ','    MAP KEYS TERMINATED BY ':'    LINES TERMINATED BY '\n'    STORED AS TEXTFILE;5.加载本地数据到hive    $hive>load data local inpath '/home/centos/employee.txt' into table employee ;    //    $hive>select * from employee ;    //重命名    $hive>alter table employee rename to emp;    $hive>select name,arr from emp ;    //查询数组指定元素    $hive>select name,arr[0] from emp ;    //查询结构体    $hive>select name,struc.sex from emp ;    //查询map指定的key    $hive>select name,map1["DB"] from emp ;    //查询map2指定的k    $hive>select map2["Product"][0] from emp ;6.使用hive实现word count    6.1)创建表        CREATE TABLE docs        (        line string        )        ROW FORMAT DELIMITED        FIELDS TERMINATED BY ',' ;    6.2)加载数据到docs        load data local inpath '/home/centos/1.txt' into table docs ;    6.3)单词统计        //单词统计        select t.word,count(*) from (select explode(split(line," ")) word from docs) t group by t.word  ;        //降序topN查询        select t.word,count(*) cnt from (select explode(split(line," ")) word from docs) t group by t.word order by cnt desc limit 3 ;

创建表完整语法

CREATE TABLE employee(name string,arr ARRAY<string>,struc STRUCT<sex:string,age:int>,map1 MAP<string,int>,map2 MAP<string,ARRAY<string>>)ROW FORMAT DELIMITED    FIELDS TERMINATED BY '|'                //默认\001    COLLECTION ITEMS TERMINATED BY ','      //默认\002    MAP KEYS TERMINATED BY ':'              //默认\003    LINES TERMINATED BY '\n'                //行结束符    STORED AS TEXTFILE;                     //

explode

UDTF,表生成函数。可以应用于array或者map.

类型转换函数

cast('124' as int);         //转换成整数concat('12','12','23','35') //字符串连接函数

DDL

    create database if not exists xx ;    //创建数据库定义位置和属性,以及注释。

    CREATE DATABASE IF NOT EXISTS myhivebook     COMMENT 'hive database demo'    LOCATION '/hdfs/directory'    WITH DBPROPERTIES ('creator'='dayongd','date'='2015-01-01')

    //查看库信息    desc database myhivebook ;    //删除库    drop database myhivebook ;

load data

    //1.从本地加载，复制过程        load data local inpath '/x/x/x/x/1.xt' into table docs ;    //2.从hdfs加载，移动过程        load data inpath '/x/x/x/x/1.xt' into table docs ;

内部表和外部表

1.内部表    托管表，删除表时，数据和表结构都删除,默认内部表。2.外部表    删除时，只删除表结构。数据还在。

CTAS:Create the table as select

    //携带数据    create table emp2 as select name from emp ;

like语句建表,只有数据结构,没有数据

create table emp3 like emp ;

truncate

    快速请空表。    truncate table emp2 ;

分区表:分区是目录。

//创建分区表

CREATE TABLE custs    (    id int,    name string ,    age int    )    PARTITIONED BY (prov string, city string)    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    LINES TERMINATED BY '\n'    STORED AS TEXTFILE;

    //添加分区    alter table custs add PARTITION (prov='hebei', city='baoding') PARTITION (prov='hebei', city='shijiazhuang');    //查看分区    SHOW PARTITIONS custs;    //删除分区    alter table custs drop partition partition(prov='hebei',city='shijizhuang') ;     //加载数据到分区    load data local inpath '/home/centos/cust.txt' into table custs partition(prov='hebei',city='baoding') ;    //按照分区查询    select * from custs where city = 'baoding' ;

桶表:桶表是文件。

//创建桶表

    CREATE TABLE buck    (    id int,    name string ,    age int    )    CLUSTERED BY (id) INTO 3 BUCKETS    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    LINES TERMINATED BY '\n'    STORED AS TEXTFILE;

//桶的数量确定标准避免桶内的数据量过大或者过小，一般以数据块的2倍为宜。//设置map个数和强行分桶set map.reduce.tasks = 2;set hive.enforce.bucketing = true;

连接

//创建orders表。

    CREATE TABLE orders    (    id int,    orderno string ,    price float,    cid int    )    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    LINES TERMINATED BY '\n'    STORED AS TEXTFILE;

    //笛卡尔积    select a.*,b.* from custs a , orders b ;    //内连接    select a.*,b.* from custs a inner join orders b on a.id = b.cid ;    //左外连接    select a.*,b.* from custs a left outer join orders b on a.id = b.cid ;    //半连接,只查询左边的表    select a.* from custs a left outer join orders b on a.id = b.cid ;    //右外    select a.*,b.* from custs a right outer join orders b on a.id = b.cid ;    //全外    select a.*,b.* from custs a full outer join orders b on a.id = b.cid ;

Map端连接

//连接暗示/*+ MAPJOIN(employee) */    SELECT /*+ MAPJOIN(employee) */ c.* FROM custs c CROSS JOIN orders o WHERE c.id  <> o.cid;    //通过设置自动map端连接转换，实现map连接    set hive.auto.convert.join=true    SELECT c.* FROM custs c CROSS JOIN orders o WHERE c.id  <> o.cid;

union查询

//union all    select id,name from custs union select id,orderno from orders ;//去重select distinct cid from orders  ;

阅读全文

0 0