Hive初探

来源：互联网发布：php相关书籍编辑：程序博客网时间：2024/05/09 15:32

注：此文从CWIKI.apache.org翻译而来

Apache Hive
Apache Hive™ 数据仓库软件为分布式存储的大数据集上的读、写、管理提供很大方便，同时还可以用SQL语法在大数据集上查询。

建立在Apache Hadoop™的 Hive 有以下特征：

通过SQL轻松访问数据的工具，从而使数据仓库任务，如提取/转换/加载（ETL），报告和数据分析变得可行。
一种对各种数据格式结构化的机制。
访问直接存储或在Apache HDFS™的文件，或其他数据存储系统的文件如Apache HBase™。
通过Apache TEZ™，Apache Spark™™，或MapReduce
采用HPL-SQL执行查询。
通过Apache YARN和Apache Slider实现亚秒级的查询检索。

Hive供标准SQL功能，包括许多后来的SQL：2003和SQL：2011分析功能。

Hive的SQL也可以通过用户自定义函数（UDF）、用户自定义聚合（UDAFs），和用户定义的表的功能（UDTFs)）进行扩展。

这并不是说有一种HIVE格式，用来存储数据。Hive带有内置的连接器，可以连接逗号、制表符分隔的文件（CSV/TSV），Apache Parquet™格式文件, Apache ORC™格式文件, 以及其它格式的文件。

用户可以用连接器扩展到其它格式，详细信息请参考开发者指南中的文件格式章节， Hive SerDe 章节。

Hive并不是为联机事务处理（OLTP）设计的，它最适用于传统的数据仓库任务。
Hive设计时考虑了最大限度的提高可测量性（扩展了更多的机器动态添加到Hadoop集群）、性能、可扩展性，容错和弱耦合其输入格式。
Hive的组件包括Hcatalog和WebHcat.

HCatalog是Hive的一部分。它是一个Hadoop的表和存储管理层，使用户能够使用不同的数据处理工具-包括 Pig 和 MapReduce , 更容易地读取和写入网格上的数据。
WebHcat提供了一个服务，使用户可以运行Hadoop MapReduce(或者YARN)， Pig,Hive作业，也可以采用HTTP(REST类型）接口执行Hive元数据。

以下是原文：
Apache Hive
The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

Built on top of Apache Hadoop™, Hive provides the following features:

Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
Query execution via Apache Tez™, Apache Spark™, or
MapReduce Procedural language with HPL-SQL
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics.

Hive’s SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).

There is not a single “Hive format” in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats.

Users can extend Hive with connectors for other formats. Please see File Formats and Hive SerDe in the Developer Guide for details.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.

Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.

Components of Hive include HCatalog and WebHCat.

HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

0 0