hive随笔

来源：互联网发布：知乎rss订阅地址编辑：程序博客网时间：2024/05/30 04:09

Hive最早是facebook开发。

l Hive 是建立在 Hadoop 上的数据仓库基础构架,是数据挖掘的一个工具（利用mapreduce挖掘hdfs上的数据）。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL ），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive定义了简单的类 SQL 查询语言，称为 QL ，它允许熟悉SQL 的用户查询数据。同时，这个语言也允许熟悉MapReduce 开发者的开发自定义的mapper 和 reducer 来处理内建的 mapper 和reducer 无法完成的复杂的分析工作。

l hive数据仓库：(大数据的保存、分析计算、一次写入多次读取)缺点：不支持实时更新与删除。

l Hive是SQL解析引擎，它将SQL语句转译成M/R Job然后在Hadoop执行。

l Hive的表其实就是HDFS的目录/文件夹（hive表数据对应HDFS里的文件），按表名把文件夹分开。如果是分区表，则分区值是子文件夹，可以直接在M/R Job里使用这些数据。

l 用 HDFS 进行存储，利用MapReduce 进行计算，元数据存储mysql（表、库、分区等的描述信息），真正的计算数据在HDFS中。

l 开发模式最常用的是CLI（即Shell命令行方式），因为JDBC/ODBC对并发和连接池支持的不是很好。

l Hive 的数据存储在HDFS 中，大部分的查询由MapReduce 完成（包含 * 的查询，比如select * from table 不会生成 MapRedcue 任务）

l Hive会将查询语句自动转换为mapreduce（迭代式计算模型）任务执行（select * from table除外），延迟非常高，不能做在线的实时应用。数据量大的情况下响应延迟为几十分钟甚至到几小时。

l Hive默认的元数据库为derby（缺点：只支持一个客户端连接，要想支持多连接必须切换启动目录[在哪个目录下运行hive就会在哪个目录下生成一个metastore_db文件]）数据库，执行相关操作后在bin目录下会生成metastore_db文件夹来存储hive的元数据信息，实际开发中将mysql作为元数据库。

(存储路径/apache-hive-1.2.1-bin/bin/metastore_db/seg0)

l Hive操作：

hive> showtables;//查看hive中的数据表

hive> showdatabases;//查看hive中的数据库（默认使用defualt库）

hive> createtable student (id int, name string);//创建student表

hive> loaddata local inpath '/root/student.txt' into table student;

//将本地数据加载到hive表中

hive> select count(*) from student;

//统计表数据的条数，会转化为mapreduce任务执行计算

hive>create table teacher (id bigint,name string) row format delimited fieldsterminated by '\t';

//创建一张teacher表，行内字段分隔符定义为’\t’

Hive>select * from teacher order by iddesc;//按照id降序返回查询结果

hive>select * from student limit 2;

hive> create database itcast;//创建数据库itcast

hive> use itcast;//切换当前使用的数据库到itcast

hive>select sum(id) from student;//计算id的总和

hive> create external table ext_student (id int,namestring) row format delimited fields terminated by '\t' location '/data';//创建一个外部表ext_student指向HDFS文件目录data,这里以后直接可以将文件上传到hdfs，然后就可以用hive直接分析hdfs上的数据（分区表模式下不可以直接发现数据，因为指定了分区字段）

hive安装mysql元数据库

1)Mysql安装

rpm -qa|grepmysql//查看已安装的mysql包

rpm –emysql-libs-5.1.66-2.el6_3.i686 –nodeps//解除依赖并删除

注意mysql初始安装root密码是空的。初次设置密码：

按回车，然后输入你原先的密码，如果原来的密码为空，直接按回车就可以。

2)修改hive配置文件

[root@thdp1conf]# vi hive-env.sh.template//（可选配置项，启动参数配置，这里默认不配置）

[root@thdp1conf]# mv hive-default.xml.template hive-site.xml//更改配置文件名称

删除所有内容只留下configuration键，添加如下内容：

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://thdp2:3306/hive?createDatabaseIfNotExist=true</value>

<description>JDBC connect string for a JDBCmetastore</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBCmetastore</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<description>username to use against metastoredatabase</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<description>password to use against metastoredatabase</description>

</property>

3)将hive连接mysql的驱动包mysql-connector-java.jar拷贝到hive的lib目录下；

[root@thdp1 lib]# cpmysql-connector-java.jar /usr/apache-hive-1.2.1-bin/lib/

4)修改thdp2上面mysql的访问权限（设置thdp1可以远程访问thdp2的mysql）

#(执行下面的语句 *.*:所有库下的所有表 %：任何IP地址或主机都可以连接[如果只允许thdp2把%改为thdp2即可])

GRANTALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123' WITH GRANT OPTION;

FLUSHPRIVILEGES;

5)在thdp1上启动hive。

hive>createtable people (id int,name string);

在thdp2上面查看mysql存储的关于hive元数据信息：

Hive分区表

按照一定的规则对表的数据进行分别存储。常见的按照省份、年份等；提高查询分析的效率。

(表下面的文件夹即代表不同的分区)

hive> create external table beauties (id bigint,name string,size double)partitioned by (nation string) row format delimited fields terminated by '\t'location '/beauty';//按照nation字段创建分区表beauties

hive> load data local inpath '/root/b.c' into table beauties partition(nation='China');//加载本地数据到分区表nation=China

hive> alter table beauties add partition (nation='Japan') location"/beauty/nation=Japan";//将已经在HDFS上的文件夹作为分区表加入到元数据库（mysql）

Join：join一个子查询

hive> selectt.account,u.name,t.income,t.expenses,t.surplus from user_info u join (selectaccount,sum(income) as income,sum(expenses) as espenses,sum(income-expenses) assurplus from trade_detail group by account) t on u.account=t.account;

hive> create tabletrade_detail (id bigint,accout string,income double,expenses double,timesstring) row format delimited fields terminated by '\t';

利用sqoop将mysql中的表导入到hive

将mysq当中的数据直接导入到hive当中

sqoopimport --connect jdbc:mysql://192.168.1.10:3306/itcast --username root--password 123 --table trade_detail --hive-import --hive-overwrite --hive-tabletrade_detail --fields-terminated-by '\t'

sqoopimport --connect jdbc:mysql://192.168.1.10:3306/itcast --username root--password 123 --table user_info --hive-import --hive-overwrite --hive-tableuser_info --fields-terminated-by '\t'

注意sqoop在向hive导入数据的时候会调用hive命令，如果找不到hive会报错，所有需要提前将hive添加到环境变量中。

Hive的udf编程

0.要继承org.apache.hadoop.hive.ql.exec.UDF类实现evaluate

自定义函数调用过程：

1.添加jar包（在hive命令行里面执行）

hive> add jar /root/NUDF.jar;

2.创建临时函数

hive> create temporary functiongetNation as 'cn.itcast.hive.udf.NationUDF';

3.调用

hive> select id, name, getNation(nation)from beauty;

4.将查询结果保存到HDFS中

hive> create table result row formatdelimited fields terminated by '\t' as select * from beauty order by id desc;

hive> select id, getAreaName(id) as namefrom tel_rec;

create table result row format delimitedfields terminated by '\t' as select id, getNation(nation) from beauties;

阅读全文

0 0