Hive-0.5中SerDe概述

来源:互联网 发布:淘宝一折特卖 - 百度 编辑:程序博客网 时间:2024/05/01 02:18

一、背景

1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe,SerDe能为表指定列,且对列指定相应的数据。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name  [(col_name data_type [COMMENT col_comment], ...)]  [COMMENT table_comment]  [PARTITIONED BY (col_name data_type    [COMMENT col_comment], ...)]  [CLUSTERED BY (col_name, col_name, ...)  [SORTED BY (col_name [ASC|DESC], ...)]  INTO num_buckets BUCKETS]  [ROW FORMAT row_format]  [STORED AS file_format]  [LOCATION hdfs_path]

创建指定SerDe表时,使用row format row_format参数,例如:

a、添加jar包。在hive客户端输入:hive>add jar  /run/serde_test.jar;或者在linux shell端执行命令:${HIVE_HOME}/bin/hive  -auxpath  /run/serde_test.jar b、建表:create table serde_table row format serde  'hive.connect.TestDeserializer';

3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数:

a)初始化:initialize(Configuration conf, Properties tb1)。

b)反序列化Writable类型返回Object:deserialize(Writable blob)。

c)获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

public interface Deserializer {  /**   * Initialize the HiveDeserializer.   * @param conf System properties   * @param tbl  table properties   * @throws SerDeException   */  public void initialize(Configuration conf, Properties tbl) throws  SerDeException;    /**   * Deserialize an object out of a Writable blob.   * In most cases, the return value of this function will be  constant since the function   * will reuse the returned object.   * If the client wants to keep a copy of the object, the client  needs to clone the   * returned value by calling  ObjectInspectorUtils.getStandardObject().   * @param blob The Writable object containing a serialized object   * @return A Java object representing the contents in the blob.   */  public Object deserialize(Writable blob) throws SerDeException;  /**   * Get the object inspector that can be used to navigate through  the internal   * structure of the Object returned from deserialize(...).   */  public ObjectInspector getObjectInspector() throws SerDeException;}

实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如:

package hive.connect;import java.net.MalformedURLException;import java.net.URL;import java.util.ArrayList;import java.util.List;import java.util.Properties;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hive.serde2.Deserializer;import org.apache.hadoop.hive.serde2.SerDeException;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import  org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import  org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;public class TestDeserializer implements Deserializer {   private static List<String> FieldNames = new ArrayList<String>();   private static List<ObjectInspector> FieldNamesObjectInspectors =  new ArrayList<ObjectInspector>();   static {     FieldNames.add("time");     FieldNamesObjectInspectors.add(ObjectInspectorFactory          .getReflectionObjectInspector(Long.class,               ObjectInspectorOptions.JAVA));     FieldNames.add("userid");     FieldNamesObjectInspectors.add(ObjectInspectorFactory          .getReflectionObjectInspector(Integer.class,               ObjectInspectorOptions.JAVA));     FieldNames.add("host");     FieldNamesObjectInspectors.add(ObjectInspectorFactory          .getReflectionObjectInspector(String.class,               ObjectInspectorOptions.JAVA));     FieldNames.add("path");     FieldNamesObjectInspectors.add(ObjectInspectorFactory          .getReflectionObjectInspector(String.class,               ObjectInspectorOptions.JAVA));   }   @Override   public Object deserialize(Writable blob) {     try {        if (blob instanceof Text) {          String line = ((Text) blob).toString();          if (line == null)             return null;          String[] field = line.split("/t");          if (field.length != 3) {             return null;          }          List<Object> result = new ArrayList<Object>();          URL url = new URL(field[2]);          Long time = Long.valueOf(field[0]);          Integer userid = Integer.valueOf(field[1]);          result.add(time);          result.add(userid);          result.add(url.getHost());          result.add(url.getPath());          return result;        }     } catch (MalformedURLException e) {        e.printStackTrace();     }     return null;   }   @Override   public ObjectInspector getObjectInspector() throws SerDeException {     return ObjectInspectorFactory.getStandardStructObjectInspector(          FieldNames, FieldNamesObjectInspectors);   }   @Override   public void initialize(Configuration arg0, Properties arg1)        throws SerDeException {   }}

测试HDFS上hive表数据,如下为一条测试数据:

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar;                                            Added /run/jar/merg_hua.jar to class pathhive> create table serde_table row format serde 'hive.connect.TestDeserializer';Found class for hive.connect.TestDeserializerOKTime taken: 0.028 secondshive> describe serde_table;OKtime    bigint  from deserializeruserid  int     from deserializerhost    string  from deserializerpath    string  from deserializerTime taken: 0.042 seconds hive> select * from serde_table;OK1234567891012   123456  wiki.apache.org /hadoop/Hive/LanguageManual/UDFTime taken: 0.039 seconds
三、总结

1、创建Hive表使用序列化时,需要自写一个实现Deserializer的类,并且选用create命令的row format参数。

2、在处理海量数据的时候,如果数据的格式与表结构吻合,可以用到Hive的反序列化而不需要对数据进行转换,可以节省大量的时间。

 
 
 
 
原创粉丝点击