Hive基础使用

来源：互联网发布：c语言程序的基本模块编辑：程序博客网时间：2024/06/03 21:43

Hive启动方式：

1.CLI界面：使用hive命令或hive --service cli即可进入hive cli。

2.远程服务：使用hive --service hiveserver2 &方式启动远程服务，这样就可以使用jdbc或thrift客户端调用hive数据仓库。

Hive数据类型

我们可以在官网的hive wiki中查看到所有的数据类型。Hive wiki

Hive数据类型大体可以分成两类，基础数据类型和复杂数据类型。这里主要讨论常用类型。

1.基础数据类型。

1）数字类型

整型    tinyint/smallint/int/integer/bigint浮点型  float/double/decimal

2）日期类型

timestamp/date/interval

3）字符串类型

string/varchar/char

4）其他类型

boolean/binary

2.复杂数据类型。

arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0.)

Hive表

Hive中的表可以分为内部表，外部表，分区表和桶表。

内部表：

表中的数据受表定义影响，表删除后表中数据随之删除。使用一般的语句创建的就是内部表，例如create table test (id int);

外部表：

数据不受表定义影响，表删除后数据仍在。

外部表的数据指向已经在HDFS中存在的数据。它和内部表在元数据的组织上是相同的，而实际数据的存储则有较大的差异。外部表只有一个过程，加载数据和创建表同时完成，并不会移动到数据库目录中，和外部数据建立一个连接。当删除一个外部表时，仅删除连接。有点类似视图。

语法：

create external table external_test (id int,name string)row format delimited fields terminated by ','location '/input';

hadoop的/input/input.txt下文件

1,Tom2,Mary

select * from external_test;OKNULLNULLNULLNULLNULLNULL1Tom2MaryTime taken: 0.496 seconds, Fetched: 5 row(s)

删除表不会对数据造成影响：

hive> drop table external_text;OKTime taken: 0.11 seconds

发现数据依然存在。

分区表：

在Hive Select查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，所以就有了分区表。

对表进行分区，很容易实现对数据部分查询的功能。

语法：

create table testPartition (id int)partitioned by (name string)row format delimited fields terminated by ',';

描述表：

hive> desc testPartition;OKid                  int                                     name                string                                    # Partition Information  # col_name            data_type           comment               name                string                                  Time taken: 0.089 seconds, Fetched: 7 row(s)

可以看到，有两个字段，并且使用name作为分区。

当然这是不够的，我们还需要定义使用确定的值作为分区，例如：

alter table testPartition add partition (name='Mary');alter table testPartition add partition (name='Tom');

这里就使用了Mary和Tom的名字作为分区。

这样，就可以在hdfs中看到，一个分区用一个文件，这样就实现了数据的部分查询快速的功能。

当然像上面那样对名字进行分区很愚蠢，这只是一个demo罢了。

桶表：

对数据做一个hash计算，经过hash运算后，然后对hash进行取模计算，比如mod 10，那么取模计算后，划分的每份的数据量是差不多的。

即自动帮我们进行分区，分区是按hash运算进行的。

优点：

hadoop进行的map任务时，每个map任务消耗的事件相差很小，做到了均衡化。

缺点：

对于where之类的业务逻辑查询没什么帮助。

语法：

create table testBucket (id int,name string)clustered by (name) into 5 buckets;

除了表之外，像在传统数据库中一样，我们还能创建视图。

语法：

create view testViewasselect *from external_testunionselect *from testBucket;

将两个表的结果合起来得到视图。视图可以简化我们的操作。

上面我们已经知道了hive如何创建表，视图。

可是我们的表，视图总需要数据吧，如何导入数据呢？

导入数据

方法1：

使用传统的insert into 语句插入数据。

语法：

insert into table test2 (id, name) values(3, "King");

但是insert into语句并不是所有版本的hive都支持的，低版本的hive是不支持insert into语句的。并且插入的时候，会生成mapReduce任务，插入速度比较慢。

方法2：

使用load语句导入数据。

语法：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

Local指从本地导入，不使用local则从hdfs中导入数据。

overwrite指示是否覆盖原表数据。

但需要注意的是，使用load导入的数据格式一定要会这个表能识别的数据。

意思是除了列数据之外，列与列之间的分隔符也要是正确的。

如果文件以‘，’作为分隔符，则需要在创建表的时候指定表的分隔符是什么。

使用

row format delimited fields terminated by ',';

来设置表的分隔符为‘，’。

默认表的分隔符为制表符。

使用load语句导入数据速度非常快。

Hive的很多其他查询，例如where，having过滤，order by排序，group by分组，count(),avg()等聚集函数，自连接，外连接等和SQL都大同小异，这里就不赘述了。

值得一说的是hive的条件函数：

hive条件表达式：

case...when...

例如：

select *,  case name  when 'King' then concat(name," hi")  else "what"  endfrom test2

Hive提供远程服务

使用JDBC/Thrift客户端的方式来访问hive数据库。

使用JDBC的方式连接hive数据库：

要连接hive数据库，首先hive数据库需要开启远程服务。

hive --service hiveserver2

在后面的版本，使用hiveserver2代替了原本的hivesever
创建一个maven项目。

增加以下依赖：

<dependency>    <groupId>org.apache.hadoop</groupId>    <artifactId>hadoop-common</artifactId>    <version>2.6.4</version></dependency><dependency>    <groupId>org.apache.hive</groupId>    <artifactId>hive-exec</artifactId>    <version>1.1.0</version>    <exclusions>        <exclusion>            <artifactId>                pentaho-aggdesigner-algorithm            </artifactId>            <groupId>org.pentaho</groupId>        </exclusion>    </exclusions></dependency><dependency>    <groupId>org.apache.hive</groupId>    <artifactId>hive-jdbc</artifactId>    <version>1.2.0</version></dependency>

代码查询示例：

import java.sql.*;/** * Created by haoye on 17-7-18. */public class JDBC {    private static String driverName = "org.apache.hive.jdbc.HiveDriver";//jdbc驱动路径    private static String url = "jdbc:hive2://127.0.0.1:10000/default";//hive库地址+库名//    private static String user = "username";//用户名//    private static String password = "pwd";//密码    private static String sql = "";    private static ResultSet res;    public static void main(String[] args) {        Connection conn = null;        Statement stmt = null;        try {            conn = getConn();            System.out.println(conn);            stmt = conn.createStatement();            String tableName="test2";//hive表名            sql = "select * from " + tableName;            System.out.println("Running:" + sql);            res = stmt.executeQuery(sql);            System.out.println("执行 select * query 运行结果:");            while (res.next()) {                System.out.println(res.getInt(1) + "\t" + res.getString(2));            }        } catch (ClassNotFoundException e) {            e.printStackTrace();            System.exit(1);        } catch (SQLException e) {            e.printStackTrace();            System.exit(1);        } finally {            try {                if (conn != null) {                    conn.close();                    conn = null;                }                if (stmt != null) {                    stmt.close();                    stmt = null;                }            } catch (SQLException e) {                e.printStackTrace();            }        }    }    private static Connection getConn() throws ClassNotFoundException,            SQLException {        Class.forName(driverName);        Connection conn = DriverManager.getConnection(url, "", "");        return conn;    }}

只需要和平常使用jdbc一样的方式使用，只是驱动不同。
可以得到结果。

Running:select * from test2执行 select * query 运行结果:3King0null0null

最后，我们来讨论一下hive的自定义函数。

hive本身有很多自带的函数，例如count()，concat()等，我们也可以自定义函数来完成复杂的功能。

hive自定义函数只需要编写函数继承UDF，并且实现evaluate方法即可。

例如：

public class ConcatString extends UDF {    public Text evaluate(Text a, Text b) {        return new Text(a.toString() + "------" + b.toString());    }}

evaluate中的参数必须是hive的类型。
实现了之后，直接使用maven进行打包。

mvn package

打包了之后可以在目录下的target/找到对应的jar包。

之后进入hive cli。

hive> add jar 你的工程目录/target/你的jar包.jar

生成临时自定义函数：

create temporary function myconcat as 'ConcatString';

这样我们就可以调用我们的自定义函数了。

hive> select  myconcat('hello', 'world');OKhello------worldTime taken: 0.207 seconds, Fetched: 1 row(s)

上面的方式生成的是临时自定义函数，在下一个会话当中就会失效，并不是永久的。

如果我们希望生成永久自定义函数怎么办呢？

首先我们要把jar上传到hdfs中。

 hadoop fs -put thriftTest-1.0-SNAPSHOT.jar 'hdfs:///input/'

然后进入hive cli。创建永久自定义函数。

create function myconcat2 as 'ConcatString' using jar 'hdfs:///input/thriftTest-1.0-SNAPSHOT.jar';

创建好了之后就可以调用了。

可以验证，即使退出，下次也能再次调用。

删除永久自定义函数：

drop function myconcat2;

删除临时自定义函数：

drop temporary function myconcat;

阅读全文

0 0