HDPCD-Java-复习笔记(19)
来源:互联网 发布:zarchiver解压数据错误 编辑:程序博客网 时间:2024/06/18 12:37
Hive
Apache Hive maintains metadata information in a metastore to generate tables. A Hive table consists of:
· A schema stored in the metastore
· Data stored on HDFS
HiveQL
Hive converts HiveQL commands into MapReduce jobs。
Hive and Pig
Pig is designed to move and restructure data, while Hive is built toanalyze data.For most use cases:
Pig -- Is a good choice for ETL jobs, where unstructured data is reformatted so that it is easier to define a structure to it.
Hive -- Is a good choice when you want to query data that has a certain known structure.
Comparing Hive to SQL
SQL Datatypes SQL Semantics
INT
SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT
Expressions in WHERE and HAVING
BOOLEAN
GROUP BY, ORDER BY, SORT BY
FLOAT
CLUSTER BY, DISTRIBUTE BY
DOUBLE
Sub-queries in FROM clause
STRING
ROLLUP
BINARY
CUBE
TIMESTAMP
UNION
ARRAY, MAP, STRUCT, UNION
LEFT, RIGHT and FULL INNER/OUTER JOIN
DECIMAL
CROSS JOIN, LEFT SEMI JOIN
CHAR
Windowing functions (OVER, RANK, etc.)
VARCHAR
Sub-queries for IN/NOT IN, HAVING
DATE
EXISTS / NOT EXISTS
INTERSECT, EXCEPT
Hive Architecture
Issuing Commands
Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer
Hive Query Plan
The Hive query is compiled, optimized and planned as a MapReduce job
MapReduce Job Executes
The corresponding MapReduce job is executed on the Hadoop cluster
HiveQL
Hive queries are written using the HiveQL language, a SQL-like scripting language that simplifies the creation of MapReduce jobs. With HiveQL, data analysts can focus onanswering questions about the data, and let the Hive frameworkconvert the HiveQL into a MapReduce job.
Hive User-Defined Functions
Hive has three different types of User-Defined Functions:
UDF
A single row is input, and a single row is output
UDAF (User-Defined Aggregate Function)
Multiple rows are input, and a single row is output
UDTF (User-Defined Table-generating Function)
A single row is input, and multiple rows (i.e. a table) are output
The Hive API contains parent classes for writing each type of User-Defined Function: the UDF class for UDF functions, the UDAF class for UDAF functions, and the GenericUDTF class for writing UDTF functions.
Writing a Hive UDF
Extends the org.apache.hadoop.hive.ql.exec.UDF class
package hiveudfs;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.DoubleWritable;public class ComputeShipping extends UDF { public static int originZip = 11344; public static double multiplier = 0.00045; DoubleWritable shippingAmt = new DoubleWritable(); public DoubleWritable evaluate(int zip, double weight) { long distance = Math.abs(originZip - zip); double amt = (distance * multiplier) + weight; shippingAmt.set(amt); return shippingAmt; }}
Invoking a Hive UDF
To invoke a UDF from within a Hive script:
· Register the JAR file that contains the UDF class, and
· Define an alias for the function using the CREATE TEMPORARY FUNCTION command.
- ADD JAR /myapp/lib/myhiveudfs.jar;
- CREATE TEMPORARY FUNCTION ComputeShipping
- AS 'hiveudfs.ComputeShipping';
- FROM orders
- SELECT address,
- description,
- ComputeShipping(zip, weight);
Overview of GenericUDF
The GenericUDF class provides more features and benefits over UDF, including:
· The arguments passed in to a GenericUDF can be complex types, including non-writable types like struct, map and array.
· The return value can also be a complex type.
· A variable length of arguments can be passed in.
· A GenericUDF can perform operations that a UDF cannot support.
· Better performance, due to lazy evaluation andshort-circuiting.
The GenericUDF class declares three abstract methods:
ObjectInspector
initialize(ObjectInspector [] arguments)
The ObjectInspector (OI) instances represent thearguments for the function. The initialize method is invoked once per GenericUDF instance and allows you tovalidate the arguments passed in.
Object
evaluate(GenericUDF.DeferredObject [] arguments)
Similar to UDF, this method gets passed in the arguments and returns the result of the function call.
String
getDisplayString(String [] children)
Return a string that gets displayed from the explain command.
Example of a GenericUDF
class ComplexUDFExample extends GenericUDF { ListObjectInspector listOI; StringObjectInspector elementOI; public String getDisplayString(String[] arg0) { return "arrayContainsExample()"; } public ObjectInspector initialize(ObjectInspector[] arguments) { if (arguments.length != 2) { throw new UDFArgumentLengthException( "method takes 2 arguments: List<T>, T"); } // Verify we received the right object types. ObjectInspector a = arguments[0]; ObjectInspector b = arguments[1]; if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) { throw new UDFArgumentException( "first argument must be a list / array, second argument must be a string"); } this.listOI = (ListObjectInspector) a; this.elementOI = (StringObjectInspector) b; // Verify that the list contains strings if (!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) { throw new UDFArgumentException( "first argument must be a list of strings"); } // the return type of our function is a boolean return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector; } public Object evaluate(DeferredObject[] arguments) { // get the list and string from the deferred objects // using the object inspectors List<String> list = (List<String>) this.listOI.getList(arguments[0].get()); String arg = elementOI.getPrimitiveJavaObject(arguments[1].get()); // check for nulls if ((list == null) || (arg == null)) { return null; } // see if our list contains the value we need for (String s : list) { if (arg.equals(s)) { return new Boolean(true); } } return new Boolean(false); }}
Overview of HCatalog
HCatalog,a metadata and table management system, helps enableschema on read in Hadoop. HCatalog has the following features:
· Makes the Hive metastore available to users of other toolson Hadoop.
· Provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive’s warehouse.
· Allows users to share data and metadata across Hive, Pig, and MapReduce.
· Provides a relational view, through SQL-like language (HiveQL), to data within Hadoop.
· Allows users to write their applications without being concerned how or where the data is stored.
· Insulates users from schema and storage format changes.
HCatalog in the Ecosystem
HCatalog provides tableabstraction, which abstracts some of the details about data like:
· How the data is stored.
· Where the data resides on the filesystem
· What format that data is in
· What the schema is of the data
HCatInputFormat and HCatOutputFormat
For Java MapReduce applications, the HCatInputFormat and HCatOutputFormat classes can be used to read and write data using HCatalog schemas and data types.
Here is an example of a Mapper that uses HCatInputFormat:public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> { String name; int age; double gpa; protected void map(WritableComparable key, HCatRecord value, Context context) throws IOException, InterruptedException { name = (String) value.get(0); age = (Integer) value.get(1); gpa = (Double) value.get(2); context.write(new Text(name), new IntWritable(age)); }}The Job configuration looks like:
String principalID = System.getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);if (principalID != null) conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);Job job = Job.getInstance (conf, "SimpleRead");HCatInputFormat.setInput(job, InputJobInfo.create(dbName, tableName, null));job.setInputFormatClass(HCatInputFormat.class);
- HDPCD-Java-复习笔记(19)
- HDPCD-Java-复习笔记(1)
- HDPCD-Java-复习笔记(2)
- HDPCD-Java-复习笔记(3)-lab
- HDPCD-Java-复习笔记(4)
- HDPCD-Java-复习笔记(5)
- HDPCD-Java-复习笔记(6)
- HDPCD-Java-复习笔记(7)- lab
- HDPCD-Java-复习笔记(8)- lab
- HDPCD-Java-复习笔记(9)-lab
- HDPCD-Java-复习笔记(10)-lab
- HDPCD-Java-复习笔记(11)
- HDPCD-Java-复习笔记(12)
- HDPCD-Java-复习笔记(13)- lab
- HDPCD-Java-复习笔记(14)- lab
- HDPCD-Java-复习笔记(15)
- HDPCD-Java-复习笔记(16)
- HDPCD-Java-复习笔记(17)
- 最近用java写hdu11页,记一下东西
- 大数据课程培训大纲及详细说明(全)
- 内部类
- 进程调度学习3
- Java集合类综合
- HDPCD-Java-复习笔记(19)
- 数据结构实验之串一:KMP简单应用
- FPGA作业1:利用74161设计20进制计数器
- Java 内存区域与内存溢出异常
- Ubuntu中安装python3
- 六级_第七天
- FreeRtos-总结(2)
- [设计]单例模式
- Android DiskLruCache完全解析,硬盘缓存的最佳方案(转)