Programming HIVE Chapter 7-14 读书笔记

来源：互联网发布：catti监控linux主机编辑：程序博客网时间：2024/04/30 05:38

Chapter7 HiveQL : Views

Hive不支持物化视图

a view will be shown using SHOW TABLES.

不可以把view作为insert或者load的对象

Chapter8 HiveQL: Indexes

Hive has limited index properties.

索引相关的技术还没有什么发展，可以根据需要对索引进行自己的定制.

create index indexname

on table tablename (columnname)

as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

index handler : 上面的as部分是实现索引的句柄

bitmap indexes

show an index

dorp an index : drop table本身不会调用drop index.

Chapter9 Schema Design

anti-pattern:

比如 table-by-day，指的是随着时间累积的table. 比如supply_2013_1_1, supply_2013_1_2 （在传统DB中经常使用的方案）

在多个表中选择数据时使用UNION ALL

如果在HIVE中做类似的实现，应该是 create table tablename (columnname INT, columnname2 STRING) partitioned by (INT day);

然后在添加表的时候： alter table tablename add PARTITION(day=20130101);

但是partition的类型不应该过细，比如select N个column，那么把它们全都放到partition里面的话就会造成问题，即Hive使处理很多大文件，而不是更多的小文件，如果过细的划分，每个partition都会被划分为一个单独的文件，这样是不好的

(exhaust the capacity of the Namenode to manage the filesystem metadata)

另外每个job都会启动一个JVM，过多的启动花费时间可能比任务本身的执行时间更长

file in each partitioned directory should be large.

a good strategy for time range partitiioning. 按照时间维度进行数据的划分

如果partition按照day和state，而在filter中是day (where day = 201211)，那么可能会存在数据倾斜的问题，因为每个table对应的state不同，因此数据不同.

Array, Map, Struct 已经保证了one-to-many的关系.

star-schema type designs are nonoptimal.

denormalization 目的： minimize disk seeks. 实际上是尽量访问连续空间, optimize IO performance.

making multiple pass over the same data

同时访问一个表，感觉有点像我们现在在做的东西

many ETL processes involve multiple processing steps. Each step may produce one or more temporary tables that are only needed until the end of the next job.

add column 过程

1. create table tablename () partitiion by (columnname columntype)

2. load data local xxx tablename partition (columnname) 将本地的文件信息加载到某个partitiion中，使得partition中真实含有数据

3. alter table tablename add columns (columnname columntype)

4. load data local xxx tablename partition (columnname2)

实际上后一次加载数据时，对table的定义进行了改变，比如说添加一个columnn，那么之前已有的数据会以NULL存在（serde是一种很高容错的行为，如果少了就会填NULL，如果多了就会抛弃）

默认Hive是row-oriented storage，而Hive可以把它定义为 hybrid row-column oriented form.

always (almost) use compression

compression是减少IO操作

Chapter10 Tuning

Chapter11 Other File Formats and Compression

Hadoop相关的工作通常是IO密集而不是CPU密集的，因此compression是可以提高系统性能的 (反之如果是CPU bound那么Compression会降低性能因为compress相关的操作会消耗CPU资源)

bzip2 : smallest compressed size and most CPU overhead

gzip : 处于二者之间的状态

LZO : larger files and much faster

另外关于算法是否支持split (如果不支持spilit，那么所有数据需要存储在一起)

each split is sent to a seperate map process.

bzip2和LZO是每个block独立的压缩

Chapter12 Developing

Chapter13 Functions

user-defined functions runs in the same task as Hive queries.

show functions : 显示当前在HIVE中load的所有函数

describe functions : 对于函数显示其简单的描述

调用函数：例如 select concat(column1, column2) as x from tablename; 实际上函数调用和直接显示列没有区别

函数分类：

aggregate functions : 比如avg(), sum()

table generating functions ：比如 select array(1,2,3) from dual; 这里的array会返回一个数组, explode一个数组会将数组中的数据转移到每行上.

定义自己的函数：class MyUDFFunction extends UDF {}

继承UDF类然后实现它的evaluate().

@Description(...) is an optional Java annotation. 会在describe function的时候显示

把写的函数编译成jar然后添加该jar路径，最后调用create function

GenericUDF : 一种更为抽象的UDF，比如case...when...

同样也是add jar 和 create temporary function来使用该函数

UDAF : aggregation function 使用的时候注意memory usage.

Hive already has a UDAF function called collect_set to add all input into java.util.Set collection.

iterate() and terminatePartial() methods are used on the map side

terminate() and merge() are used on the reduce side

a common usage of Hive is to analyze web logs.

maxmind 是一个ip和实际地理位置的转换工具

ADD FILE is used to cache the necessary data files with Hive.

ADD JAR is required to add JAR files to the cache.

annotations for use with functions.

boolean deterministit(); 每次返回的结果是确定的

boolean stateful(); 好像类似上面

boolean distinctLike();

创建宏 create temporary macro

Chapter14 Streanming

与除java外的其它编程语言的接口

pipeline computing model

streaming is used to integrate non-Java code into Hive.

0 0