Hadoop: the definitive guide 第三版拾遗第十二章之Hive初步

来源：互联网发布：linux系统环境变量设置编辑：程序博客网时间：2024/05/21 09:53

Hive简介

Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 HQL，它允许熟悉 SQL 的用户查询数据。同时，这个语言也允许熟悉 MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作。

Hive 没有专门的数据格式。 Hive 可以很好的工作在 Thrift 之上，控制分隔符，也允许用户指定数据格式。

主要特点：

存储方式是将结构化的数据文件映射为一张数据库表。提供类SQL语言，实现完整的SQL查询功能。可以将SQL语句转换为MapReduce任务运行，十分适合数据仓库的统计分析。

不足之处：

采用行存储的方式（SequenceFile）来存储和读取数据。效率低：当要读取数据表某一列数据时需要先取出所有数据然后再提取出某一列的数据，效率很低。同时，它还占用较多的磁盘空间。

目前优化：

由于以上的不足，有人（查礼博士）介绍了一种将分布式数据处理系统中以记录为单位的存储结构变为以列为单位的存储结构，进而减少磁盘访问数量，提高查询处理性能。这样，由于相同属性值具有相同数据类型和相近的数据特性，以属性值为单位进行压缩存储的压缩比更高，能节省更多的存储空间。

Hive安装配置

安装需求

Java 1.6
Hadoop

从官网下载 Hive Releases，在相应目录下解压缩。

$ tar -xzvf hive-x.y.z.tar.gz

设置系统环境变量：(unix中为：/etc/profile文件)

[java] view plaincopy

export HIVE_HOME=.../pig-x.y.z
export PATH=$PATH:$HIVE_HOME/bin

Hive操作

一、两种方式：

1、非交互式：

建表：

输入数据：

查询：

2、交互式：

建表：

输入数据：

查询1：

查询2：

二、语法解析：（英文部分来自维基百科Hive LanguageManual DML）

1、加载文件导入表（Loading files into tables）

Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.

当数据被加载至表中时，不会对数据进行任何转换。Load 操作只是将数据复制/移动至 Hive 表对应的位置。

HIVE装载数据没有做任何转换加载到表中的数据只是进入相应的配置单元表的位置移动数据文件。纯加载操作复制/移动操作。

语法：

LOAD DATA [LOCAL] INPATH 'filepath'[OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

解析：

Load 操作只是单纯的复制/移动操作，将数据文件移动到 Hive 表对应的位置。

filepath 可以是：
- 相对路径，例如：project/data1
- 绝对路径，例如： /user/hive/project/data1
- 包含模式的完整 URI，例如：hdfs://namenode:9000/user/hive/project/data1
加载的目标可以是一个表或者分区。如果表包含分区，必须指定每一个分区的分区名。
filepath 可以引用一个文件（这种情况下，Hive 会将文件移动到表所对应的目录中）或者是一个目录（在这种情况下，Hive 会将目录中的所有文件移动至表所对应的目录中）。
如果指定了 LOCAL，那么：
- load 命令会去查找本地文件系统中的 filepath。如果发现是相对路径，则路径会被解释为相对于当前用户的当前路径。用户也可以为本地文件指定一个完整的 URI，比如：file:///user/hive/project/data1.
- load 命令会将 filepath 中的文件复制到目标文件系统中。目标文件系统由表的位置属性决定。被复制的数据文件移动到表的数据对应的位置。
如果没有指定 LOCAL 关键字，如果 filepath 指向的是一个完整的 URI，hive 会直接使用这个 URI。否则：
- 如果没有指定 schema 或者 authority，Hive 会使用在 hadoop 配置文件中定义的 schema 和 authority，fs.default.name 指定了 Namenode 的 URI。
- 如果路径不是绝对的，Hive 相对于 /user/ 进行解释。
- Hive 会将 filepath 中指定的文件内容移动到 table （或者 partition）所指定的路径中。
如果使用了 OVERWRITE 关键字，则目标表（或者分区）中的内容（如果有）会被删除，然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。
如果目标表（分区）已经有一个文件，并且文件名和 filepath 中的文件名冲突，那么现有的文件会被新文件所替代。

Notes

filepath cannot contain subdirectories.
If the keyword LOCAL is not given, filepath must refer to files within the same filesystem as the table's (or partition's) location.
Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format, the files being loaded are also sequencefiles, and vice versa.
Please read CompressedStorage if your datafile is compressed

示例：

从本地导入数据到表格并追加原表：

LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE c02 PARTITION(date='2008-06-08', country='US')

从本地导入数据到表格并追加记录:

LOAD DATA LOCAL INPATH './examples/files/kv1.txt' INTO TABLE pokes;

从hdfs导入数据到表格并覆盖原表:

LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20101101/18/clickstat_gp_fatdt0/0' INTO table c02_clickstat_fatdt1 OVERWRITE PARTITION (dt='20101201');

关于来源的文本数据的字段间隔符
如果要将自定义间隔符的文件读入一个表，需要通过创建表的语句来指明输入文件间隔符，然后load data到这个表就ok了

2、Inserting data into Hive Tables from queries

Query Results can be inserted into tables by using the insert clause

语法：

标准语法:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

解析：

INSERT OVERWRITE will overwrite any existing data in the table or partition
- unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0)
INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)

Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query
The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement.
The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table)
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

Notes

Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.

Insert时，from子句既可以放在select子句后，也可以放在insert子句前，下面两句是等价的：

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

hive没有直接插入一条数据的sql，不过可以通过其他方法实现：
假设有一张表B至少有一条数据，我们想向表A（int，string）中插入一条数据，可以用下面的方法实现：

from Binsert table A select  1，‘abc’ limit 1；

hive好像不能够插入一个记录，因为每次你写insert语句的时候都是要将整个表的值overwrite。我想这个应该是与hive的storage layer是有关系的，因为它的存储层是HDFS，插入一个数据要全表扫描，还不如用整个表的替换来的快些。
Hive不支持一条一条的用insert语句进行插入操作，也不支持update的操作。数据是以load的方式，加载到建立好的表中。数据一旦导入，则不可修改。要么drop掉整个表，要么建立新的表，导入新的数据。

3、从查询将数据写入文件系统：

语法：

标准语法：
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
 
Hive扩展语法 (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

语法解析：

目录可以是一个完整的URI。如果scheme或authority 未指定，Hive将使用Hadoop的配置变量fs.default.name指定Namenode URI。
如果使用本地关键字，Hive将数据写入到本地文件系统上的目录。
写入文件系统数据序列化为用“^”分离由换行符分开的列文本。如果有任何列不是原始类型，那些列将被序列化为JSON格式。

Notes

INSERT OVERWRITE statements to directories, local directories, and tables (or partitions) can all be used together within the same query.
INSERT OVERWRITE statements to HDFS filesystem directories are the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.
The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.

常见问题及解决方法

1、集成mysql,运行hive。

FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

打开调试模式：

#hive -hiveconf hive.root.logger=DEBUG,console

发现是hive的mysql用户被deny了。

#grant all privileges on *.* to  'hive'@'localhost' ; flush privileges

如果还不行，修改hive-site.xml：

<property>    <name>javax.jdo.option.ConnectionURL</name>    <value>jdbc:mysql://192.168.1.101:3306/hive?createDatabaseIfNotExist=true</value>    <description>JDBC connect string for a JDBC metastore</description>  </property>

Hadoop: the definitive guide 第三版 拾遗 第十二章 之Hive初步

Hive简介

主要特点：

不足之处：

目前优化：

Hive安装配置

Hive操作

一、两种方式：

1、非交互式：

2、交互式：

二、语法解析： （英文部分来自维基百科Hive LanguageManual DML）

1、加载文件导入表（Loading files into tables）

Notes

2、Inserting data into Hive Tables from queries

Notes

hive没有直接插入一条数据的sql，不过可以通过其他方法实现：假设有一张表B至少有一条数据，我们想向表A（int，string）中插入一条数据，可以用下面的方法实现：

3、从查询将数据写入文件系统：

语法：

Notes

常见问题及解决方法

1、集成mysql,运行hive。

Hadoop: the definitive guide 第三版拾遗第十二章之Hive初步

二、语法解析：（英文部分来自维基百科Hive LanguageManual DML）

hive没有直接插入一条数据的sql，不过可以通过其他方法实现：
假设有一张表B至少有一条数据，我们想向表A（int，string）中插入一条数据，可以用下面的方法实现：