HDPCD-Java-复习笔记(19)

来源:互联网 发布:zarchiver解压数据错误 编辑:程序博客网 时间:2024/06/18 12:37

Hive


Apache   Hive  maintains metadata information in a metastore to generate tables. A Hive table consists of:

·         A schema stored in the metastore

·         Data stored on HDFS


HiveQL

Hive converts HiveQL commands into MapReduce jobs。

 

Hive and Pig

Pig is designed to move and restructure data, while Hive is built toanalyze data.For most use cases:

Pig -- Is a good choice for ETL jobs, where unstructured data is reformatted so that it is easier to define a structure to it.

Hive -- Is a good choice when you want to query data that has a certain known structure.


Comparing Hive to SQL

SQL Datatypes

SQL Semantics

INT

SELECTLOADINSERT from query

TINYINT/SMALLINT/BIGINT

Expressions in WHERE and HAVING

BOOLEAN

GROUP BYORDER BYSORT BY

FLOAT

CLUSTER BYDISTRIBUTE BY

DOUBLE

Sub-queries in FROM clause

STRING

ROLLUP

BINARY

CUBE

TIMESTAMP

UNION

ARRAYMAPSTRUCTUNION

LEFTRIGHT and FULL INNER/OUTER JOIN

DECIMAL

CROSS JOINLEFT SEMI JOIN

CHAR

Windowing functions (OVERRANK, etc.)

VARCHAR

Sub-queries for IN/NOT INHAVING

DATE

EXISTS / NOT EXISTS

INTERSECTEXCEPT



Hive Architecture



Issuing Commands

Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer

Hive Query Plan

The Hive query is compiled, optimized and planned as a MapReduce job

MapReduce Job Executes

The corresponding MapReduce job is executed on the Hadoop cluster


HiveQL

Hive queries are written using the HiveQL language, a SQL-like scripting language that simplifies the creation of MapReduce jobs. With HiveQL, data analysts can focus onanswering questions about the data, and let the Hive frameworkconvert the HiveQL into a MapReduce job.


Hive User-Defined Functions

Hive has three different types of User-Defined Functions:

UDF

A single row is input, and a single row is output

UDAF (User-Defined Aggregate Function)

Multiple rows are input, and a single row is output

UDTF (User-Defined Table-generating Function)

A single row is input, and multiple rows (i.e. a table) are output


The Hive API contains parent classes for writing each type of User-Defined Function: the UDF class for UDF functions, the UDAF class for UDAF functions, and the GenericUDTF class for writing UDTF functions.


Writing a Hive UDF

Extends the org.apache.hadoop.hive.ql.exec.UDF class

package hiveudfs;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.DoubleWritable;public class ComputeShipping extends UDF {    public static int originZip = 11344;    public static double multiplier = 0.00045;    DoubleWritable shippingAmt = new DoubleWritable();    public DoubleWritable evaluate(int zip, double weight) {        long distance = Math.abs(originZip - zip);        double amt = (distance * multiplier) + weight;        shippingAmt.set(amt);        return shippingAmt;    }}

Invoking a Hive UDF

To invoke a UDF from within a Hive script:

·         Register the JAR file that contains the UDF class, and

·         Define an alias for the function using the CREATE TEMPORARY FUNCTION command.


  • ADD JAR /myapp/lib/myhiveudfs.jar;
  • CREATE TEMPORARY FUNCTION ComputeShipping
  •   AS 'hiveudfs.ComputeShipping';
  • FROM orders
  •    SELECT address,
  •           description,
  •           ComputeShipping(zip, weight);

Overview of GenericUDF

The GenericUDF class provides more features and benefits over UDF, including:

·         The arguments passed in to a GenericUDF can be complex types, including non-writable types like structmap and array.

·         The return value can also be a complex type.

·         A variable length of arguments can be passed in.

·         A GenericUDF can perform operations that a UDF cannot support.

·         Better performance, due to lazy evaluation andshort-circuiting.


The GenericUDF class declares three abstract methods:

ObjectInspector
initialize(ObjectInspector [] arguments)

The ObjectInspector (OI) instances represent thearguments for the function. The initialize method is invoked once per GenericUDF instance and allows you tovalidate the arguments passed in.

Object 
evaluate(GenericUDF.DeferredObject [] arguments)

Similar to UDF, this method gets passed in the arguments and returns the result of the function call.

String 
getDisplayString(String [] children)

Return a string that gets displayed from the explain command.


Example of a GenericUDF

class ComplexUDFExample extends GenericUDF {    ListObjectInspector listOI;    StringObjectInspector elementOI;    public String getDisplayString(String[] arg0) {        return "arrayContainsExample()";    }    public ObjectInspector initialize(ObjectInspector[] arguments) {        if (arguments.length != 2) {            throw new UDFArgumentLengthException(                "method takes 2 arguments: List<T>, T");        }        // Verify we received the right object types.        ObjectInspector a = arguments[0];        ObjectInspector b = arguments[1];        if (!(a instanceof ListObjectInspector) ||                !(b instanceof StringObjectInspector)) {            throw new UDFArgumentException(                "first argument must be a list / array, second argument must be a string");        }        this.listOI = (ListObjectInspector) a;        this.elementOI = (StringObjectInspector) b;        // Verify that the list contains strings        if (!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {            throw new UDFArgumentException(                "first argument must be a list of strings");        }        // the return type of our function is a boolean        return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;    }    public Object evaluate(DeferredObject[] arguments) {        // get the list and string from the deferred objects        // using the object inspectors        List<String> list = (List<String>) this.listOI.getList(arguments[0].get());        String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());        // check for nulls        if ((list == null) || (arg == null)) {            return null;        }        // see if our list contains the value we need        for (String s : list) {            if (arg.equals(s)) {                return new Boolean(true);            }        }        return new Boolean(false);    }}

Overview of HCatalog

HCatalog,a metadata and table management system, helps enableschema on read in Hadoop. HCatalog has the following features:

·         Makes the Hive metastore available to users of other toolson Hadoop.

·         Provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive’s warehouse.

·         Allows users to share data and metadata across Hive, Pig, and MapReduce.

·         Provides a relational view, through SQL-like language (HiveQL), to data within Hadoop.

·         Allows users to write their applications without being concerned how or where the data is stored.

·         Insulates users from schema and storage format changes.


HCatalog in the Ecosystem



HCatalog  provides tableabstraction, which abstracts some of the details about data like:

·         How the data is stored.

·         Where the data resides on the filesystem

·         What format that data is in

·         What the schema is of the data



HCatInputFormat and HCatOutputFormat

For Java MapReduce applications, the HCatInputFormat and HCatOutputFormat classes can be used to read and write data using HCatalog schemas and data types.

Here is an example of a Mapper that uses HCatInputFormat:

public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> {    String name;    int age;    double gpa;    protected void map(WritableComparable key, HCatRecord value, Context context)        throws IOException, InterruptedException {        name = (String) value.get(0);        age = (Integer) value.get(1);        gpa = (Double) value.get(2);        context.write(new Text(name), new IntWritable(age));    }}
The Job configuration looks like:

String principalID = System.getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);if (principalID != null) conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);Job job = Job.getInstance (conf, "SimpleRead");HCatInputFormat.setInput(job, InputJobInfo.create(dbName, tableName, null));job.setInputFormatClass(HCatInputFormat.class);