HBase重点总结

来源：互联网发布：我知女人心阅读全文编辑：程序博客网时间：2024/05/01 16:02

Hbase

官网：http://hbase.apache.org/

1、hbase rowkey怎么创建比较好，列簇怎么创建比较好？

1、三维

Hbase有序存储的三维是指：rowkey（行主键），column key(columnFamily+qualifier)，timestamp(时间戳)三部分组成的三维有序存储。

2、Hbase 表添加大量数据

File/datas--->hfile--->bulk load into hbase table

一个region大量数据放到里面，regionServer有可能出问题，所以进行预分区，数据同时插入多个region。

实质就是提前划分region

默认情况：startkey和endkey都为空

提前分区：提前创建多个region，为每个region划分rowkey的区间

实现预分区:rowkey是前缀匹配

创建预分区的三种方式：

Create “ns1:t1”,”info”,split => [‘10’,’20’,’30’,’40’,’50’] 6个region哦

Create “ns1:t1”,”info”,SPLIT_FILE =>’split.txt’

Create“ns1:t1”,”info”,{NUMREGIONS =>6,SPLITALGO =>’HexStringSplit’}

3、Rowkey设计:

默认情况下，是索引检索的唯一依据。

目的：尽量减少磁盘空间，加快索引速度，避免热点问题。

基本原则：根据企业业务需求设计

唯一原则：rowkey具有唯一性

设计原则：建议rowkey长度在100以内，越短越好8-16

散列原则: 避免热点问题，随机值+算法生成

反转字符串：123--->321

4、例如:

主表：rowkey phone+time

1、索引表：rowkey别人号码+时间

列簇：info

列主表rowkey

主表与索引表同步-------->事务

https://phoenix.apache.org 在nosql中建立sql（客户端）jdbc同步

创建索引表的第二种方式

Solr

自动创建索引 cloudera search

字符串拼接：随机值+话单+时间

2、Hbase过滤器实现原则

所有的过滤器都在服务端生效，叫做谓语下推(predicate push down),这样可以保证被过滤掉的数据不会被传送到客户端。

注意：

基于字符串的比较器，如RegexStringComparator和SubstringComparator，比基于字节的比较器更慢，更消耗资源。因为每次比较时它们都需要将给定的值转化为String.截取字符串子串和正则式的处理也需要花费额外的时间。
过滤器本来的目的是为了筛掉无用的信息，所有基于CompareFilter的过滤处理过程是返回匹配的值。

filter ==> SQL 中的Where

3、Hbase读写数据的过程

读的流程（zookeeper(regionserver地址)-->hbase(hbase(meta(regionserver(region(roekey)))))--

Region(操作))

具体流程：

1、语句检索rowkey 找到具体region-->链接zookeeper（查找meta的regionserver地址）--->找到hbase的regionserver及其管理的region-->

2、读取一行数据--->先检查memstore等待修改的队列--->在检查blockcache看是否包含该行的block是否最近被访问过---> 最后访问硬盘上对应的hfile

写的流程：Put ‘tname’,’rowkey’,’cf:c’,’value’

先写到hlog，然后在向memstore写，memstore到达一定阈值后，在向hfile写。

具体流程：

1、语句检索：rowkey + cf +col -》 region

2、链接zookeeper（查找meta的regionserver地址）--->找到hbase的regionserver-->

Region-->store（向memstore写，memstore到达一定阈值后flush成一个storefile，storefile增长到一定阈值促发compatct合并操作，单个storefile超过一定阈值会促发split操作,把当前region split成两个region，原本的region下线，新生的两个region被HMaster分配到相应的regionserver)-->storefile(hfile)-->hdfs

4、Hbase宕机如何处理

5、Hbase怎么预分区

分区:

默认一张表一个region:

Rowkey:0000--1000

Region1:0000-0500 startkey:0000

Region2:0501-1000 startkey:0500

后面的数据：1001--->region2

1002--->region2

....................................

1500-->region2-->region3:0501-1000

region4:1001-1500 rowkey一直是递增的。

与分区：提前划分region

默认情况下：startkey和endkey都为空

提前分区：提前创建多个region，为每个region划分rowkey

实现：

创建预分区的三种方式：

Create “ns1:t1”,”info”,split => [‘10’,’20’,’30’,’40’,’50’] 6个region哦

Create “ns1:t1”,”info”,SPLIT_FILE =>’split.txt’

Create“ns1:t1”,”info”,{NUMREGIONS =>6,SPLITALGO =>’HexStringSplit’}

6、Hbase处理并发问题

锁与mvcc机制

具体参见此篇博客：http://www.cnblogs.com/leetieniu2014/p/5393755.html

HBase同步机制

HBase提供了两种同步机制，一种是基于CountDownLatch实现的互斥锁，常见的使用场景是行数据更新时所持的行锁。另一种是基于ReentrantReadWriteLock实现的读写锁，该锁可以给临界资源加上read-lock或者write-lock。其中read-lock允许并发的读取操作，而write-lock是完全的互斥操作。

CountDownLatch

Java中，CountDownLatch是一个同步辅助类，在完成一组其他线程执行的操作之前，它允许一个或多个线程阻塞等待。CountDownLatch使用给定的计数初始化，核心的两个方法是countDown()和await()，前者可以实现给定计数倒数一次，后者是等待计数倒数到0，如果没有到达0，就一直阻塞等待。结合线程安全的map容器，基于test-and-set机制，CountDownLatch可以实现基本的互斥锁，原理如下：

1. 初始化：CountDownLatch初始化计数为1

2. test过程：线程首先将临界资源作为key，latch作为value尝试插入线程安全的map中。如果返回失败，表示其他线程已经持有了该锁，调用await方法阻塞到该latch上，等待其他线程释放锁；

3. set过程：如果返回成功，就表示已经持有该锁，其他线程必然插入失败。持有该锁之后执行各种操作，执行完成之后释放锁，释放锁首先将map中对应的KeyValue移除，再调用latch的countDown方法，该方法会将计数减1，变为0之后就会唤醒其他阻塞线程。

ReentrantReadWriteLock

读写锁分为读锁、写锁，和互斥锁相比可以提供更高的并行性。读锁允许多个线程同时以读模式占有锁资源，而写锁只能由一个线程以写模式占有。如果读写锁是写加锁状态，在锁释放之前，所有试图对该锁占有的线程都会被阻塞；如果是读加锁状态，所有其他对该锁的读请求都会并行执行，但是写请求会被阻塞。显而易见，读写锁适合于读多写少的场景，也因为读锁可以共享，写锁只能某个线程独占，读写锁也被称为共享－独占锁，即经常见到的S锁和X锁。

Java中，ReentrantReadWriteLock是读写锁的实现类，该类中有两个方法readLock()和writeLock()分别用来获取读锁和写锁。

HBase中行锁的具体实现

HBase采用行锁实现更新的原子性，要么全部更新成功，要么失败。所有对HBase行级数据的更新操作，都需要首先获取该行的行锁，并且在更新完成之后释放，等待其他线程获取。因此，HBase中对同一行数据的更新操作都是串行操作。

HBase中MVCC机制的实现

如上文所述，HBase分别提供了行锁和读写锁来实现行级数据、Store级别以及Region级别的并发控制。除此之外，HBase还提供了MVCC机制实现数据的读写并发控制。MVCC，即多版本并发控制技术，它使得事务引擎不再单纯地使用行锁实现数据读写的并发控制，取而代之的是，把行锁与行的多个版本结合起来，经过简单的算法就可以实现非锁定读，进而大大的提高系统的并发性能。HBase正是使用行锁＋ MVCC保证高效的并发读写以及读写数据一致性。

7、Hive与hbase的区别是

共同点：
hbase与hive都是架构在hadoop之上的。都是用hdfs作为底层存储。
区别：
Hive是建立在Hadoop之上为了减少MapReduce jobs编写工作的批处理系统，HBase是为了支持弥补Hadoop对实时操作的缺陷的项目。
Hive query就是MapReduce jobs可以从5分钟到数小时不止，Hbase的能够在它的数据库上实时运行，而不是运行MapReduce任务。
Hive本身不存储和计算数据，它完全依赖于HDFS和MapReduce，Hive中的表纯逻辑。
hbase是物理表，不是逻辑表，是列存储，提供一个超大的内存hash表，搜索引擎通过它来存储索引，方便查询操作。

8、javaAPI

public class HBaseClient {public static void main(String[] args) throws Exception {HBaseClient hc = new HBaseClient();HTable table = hc.getTable("test:test1");hc.getData(table);}public static HTable getTable(String tbname) throws Exception {// get the hbase ConfigurationConfiguration conf = HBaseConfiguration.create();// new the Hbase table instanceHTable table = new HTable(conf, tbname);return table;}/** * put 'tname','rowkey','cf1:c','value' */public void putData(HTable table) throws Exception {Put put = new Put(Bytes.toBytes("2016_10004"));put.add(Bytes.toBytes("cf1"), Bytes.toBytes("name"),Bytes.toBytes("rainbow"));// put to the tabletable.put(put);}/** * get 'tname','rowkey','cf1:age' */public void getData(HTable table) throws Exception {Get get = new Get(Bytes.toBytes("2016_10001"));get.addFamily(Bytes.toBytes("cf1"));// put to the tableResult rs = table.get(get);for (Cell cell : rs.rawCells()) {System.out.println("列簇-->"+ Bytes.toString(CellUtil.cloneFamily(cell)) + "列名-->"+Bytes.toString(CellUtil.cloneQualifier(cell))+"rowkey-->" + Bytes.toString(CellUtil.cloneRow(cell))+ "值-->" + Bytes.toString(CellUtil.cloneValue(cell)));System.out.println("------------------------------");}}/** * delete 'tname','rowkey','cf1:c' */public void deleteData(HTable table) throws Exception {Delete delete = new Delete(Bytes.toBytes("2016_10001"));delete.deleteColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"));// put to the tabletable.delete(delete);}/** * scan 'tname','rowkey','cf1:c' */public void scanData(HTable table) throws Exception {Scan scan = new Scan(Bytes.toBytes("2016_10001"),Bytes.toBytes("2016_10004"));// put to the tableResultScanner rScanner=table.getScanner(scan);for (Result result : rScanner) {System.out.println("-------rowkey-------"+Bytes.toString(result.getRow()));for (Cell cell : result.rawCells()) {System.out.println("列簇-->"+ Bytes.toString(CellUtil.cloneFamily(cell)) + "列名-->"+ Bytes.toString(CellUtil.cloneQualifier(cell))+ "rowkey-->" + Bytes.toString(CellUtil.cloneRow(cell))+ "值-->" + Bytes.toString(CellUtil.cloneValue(cell)));System.out.println("------------------------------");}

9、Hbase数据如何导入到mysql

1、Sqoop:导入：

Mysql-->hdfs 、hive

Hive :第一步：将数据导入到hdfs

第二步：将数据加载到hdfs（load data (移动)）

Sqoop的导出：

Hdfs-->rdbms

Hbase与sqoop的集成：

1、配置：

2、Hbase

Mysql--->hbase

./sqoop import \

--connection jdbc:mysql://rainbow.com.cn:3306/db \

--username root \

--password root \

--table mysqltable \

--hbase --create-table \

--hbase-table hbasetable \

--hbase-row-key id \

--column-family cf1 \

使用相关参数进行过滤

Hbase--->mysql

10、Hbase瓶颈

分布式框架的瓶颈：磁盘传输效率IO，网络带宽传输效率。

11、Hbase一行数据怎么存储

预写日志

1、语句检索：rowkey + cf +col -》 region

2、链接zookeeper（查找meta的regionserver地址）--->找到hbase的regionserver-->

12、Hbase存储原理

物理存储

逻辑存储keyvalue

key:rowkey + cf+clo+timestamp

value:value

region最基本的存储单元

底层的数据存储：字节

13、Hbase与mr的集成

1、hbase作为输入（map）

TableMapReduceUtil.initTableMapperJob(scanList, NewInstallUserMapper.class,

StatsUserDimension.class, Text.class, job,false);

2、hbase作为输出

TableMapReduceUtil.initTableReducerJob(table, null, job);

3、hbase作为输入输出

package org.apache.hadoop.hbase;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;import org.apache.hadoop.hbase.mapreduce.TableMapper;import org.apache.hadoop.hbase.mapreduce.TableReducer;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner; public class HbaseMr extends Configured implements Tool{ /** *  mapper *  Mapper<ImmutableBytesWritable, Result, KEYOUT, VALUEOUT> */public static class readMap extends TableMapper<ImmutableBytesWritable, Put>{ @Overrideprotected void map(ImmutableBytesWritable key,Result value,Context context)throws IOException, InterruptedException {// TODO Auto-generateds method stubPut put = new Put(key.get());//choose the info:name and info:age to putfor(Cell cell : value.rawCells()){if("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))){if("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))){put.add(cell);}}}context.write(key, put);}}/** * reducer * Reducer<KEYIN, VALUEIN, KEYOUT, Mutation> */public static class writeReduce extends TableReducer<ImmutableBytesWritable, Put, ImmutableBytesWritable>{ @Overrideprotected void reduce(ImmutableBytesWritable key,Iterable<Put> values,Context context)throws IOException, InterruptedException {// TODO Auto-generated method stub//当前情况只适用于输出表结构与输入表结构一致for(Put put : values){context.write(key, put);}}}/** * driver */public int run(String[] args) throws Exception {// TODO Auto-generated method stubConfiguration conf = super.getConf();//create the jobJob job = Job.getInstance(conf, "mr-hbase");job.setJarByClass(HbaseMr.class);//map and reduce//define the scan instanceScan scan = new Scan();TableMapReduceUtil.initTableMapperJob(  "stu_info",      // input table  scan,             // Scan instance to control CF and attribute selection  readMap.class,   // mapper class  ImmutableBytesWritable.class,             // mapper output key  Put.class,             // mapper output value  job);TableMapReduceUtil.initTableReducerJob(  "exp",      // output table  writeReduce.class,             // reducer class  job);//submitboolean isSuccess = job.waitForCompletion(true);return isSuccess ? 0:1;}public static void main(String[] args) throws Exception {//get the confConfiguration conf = HBaseConfiguration.create();//run jobint status = ToolRunner.run(conf, new HbaseMr(), args);//exitSystem.exit(status);}}

14、HBase与hive集成

HIVE与hbase集成

-》拷贝jar包到hive的lib目录下

export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2

export HIVE_HOME=/opt/modules/hive-0.13.1

Ln-s$HBASE_HOME/lib/hbase-common-0.98.6-hadoop2.jar

$HIVE_HOME/lib/hbase-common-0.98.6-hadoop2.jar

ln-s$HBASE_HOME/lib/hbase-server-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-server-0.98.6-hadoop2.jar

ln-s$HBASE_HOME/lib/hbase-client-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-client-0.98.6-hadoop2.jar

ln-s$HBASE_HOME/lib/hbase-protocol-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-protocol-0.98.6-hadoop2.jar

Ln-s$HBASE_HOME/lib/hbase-it-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-it-0.98.6-hadoop2.jar

ln -s $HBASE_HOME/lib/htrace-core-2.04.jar $HIVE_HOME/lib/htrace-core-2.04.jar

Ln-s$HBASE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar

ln-s$HBASE_HOME/lib/hbase-hadoop-compat-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-hadoop-compat-0.98.6-hadoop2.jar

ln -s $HBASE_HOME/lib/high-scale-lib-1.1.1.jar $HIVE_HOME/lib/high-scale-lib-1.1.1.jar

-》修改hive的配置文件

<name>hbase.zookeeper.quorum</name>

<value>rainbow.com.cn</value>

</property>

-》创建hive管理变表

CREATE TABLE hbase_table_1(

empno int ,

ename string ,

job string ,

mgr int,

hiredate string ,

sal double ,

comm double ,

deptno int

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")

TBLPROPERTIES ("hbase.table.name" = "hbase_table_1");

hive中创建时，会自动创建hbase中的表

hive与hbase集成的数据只能走mapreduce生成

insert into table hbase_table_1 select * from emp;

-》创建hive外部表

CREATE EXTERNAL TABLE event_logs20151220(key string, pl string, ver string, en string, u_ud string, u_sd string, s_time bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:pl,info:ver,info:en,info:u_ud,info:u_sd,info:s_time') TBLPROPERTIES('hbase.table.name' = 'event_logs20151220')

15、HBase与Hue的集成

1、将hbase的thrift服务启动，默认端口9090

bin/hbase-daemon.sh start thrift

jps:ThriftServer

2、hue的配置hue.ini

hbase_clusters=(Cluster|hadoop-senior01.ibeifeng.com:9090)

# HBase configuration directory, where hbase-site.xml is located.

hbase_conf_dir=/opt/modules/hbase-0.98.6-hadoop2/conf

16、HDFS数据导入HBase表中

第一种方式：直接将数据文件插入到hbase表中

${HADOOP_HOME}/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age stu_info /tsv/testtsv

应用场景：批量数据的插入

第二种方式：

By default importtsv will load data directly into HBase. To instead generate

HFiles of data to prepare for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output

第一步：将数据文件转换为HFILE

${HADOOP_HOME}/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv -Dimporttsv.bulk.output=/outputtsv2

-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age test:test3 /tsv/testtsv

第二步：将hfile文件导入到表中

${HADOOP_HOME}/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar completebulkload

usage: completebulkload /path/to/hfileoutputformat-output tablename

${HADOOP_HOME}/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar completebulkload /outputtsv2 stu_info

17、Hbase优化

1、预分区操作

2、cf的设计及rowkey的设计

cf与rowkey的长度越短越好

3、自动compact操作，true

改成手动

4、split：手动

5、flush：手动

6、wal：不写hlog

flume+hbase

7、开启blockcache

8、compact和split、flush、cache、memstore的阈值

18、hbase架构

客户端连接zookeeper-----》regionserver------>客户端在连接regionserver(读写)

（1）Zookeeper

保证任何时候，集群中只有一个HMaster( 多个，只有一个是活动的，没有单节点故障)

Master:也要连接zookeeper (zookeeper客户端的hbase/rs存放regionserver的位置）master要知道有哪些regionserver活着）

存储所有的HRegion的寻址入口

实时监控HRegion Server的上线和下线信息，并实时通知给（zookeeper通知的）HMaster

存储Hbase的schema和table元数据（zookeeper客户端的/hbase/table）

Zookeeper Quorum存储了meta地址和MAster地址

（2）HMaster

集群中只有一个HMaster( 多个，只有一个是活动的，没有单节点故障)

管理用户对table的增删查改；

管理HRegionServer的负载均衡，调整region的分布；

Region Split后，负责新Region的分布；

在HRegionServer停机后，负责失效HRegionServer迁移工作。

（3）HRegion Server

维护HRegion,处理对这些HRegi on的IO请求，向HDFS文件系统中读写数据；

负责切分在运行过程中变得过大的HRegion;

Client访问HBase上的数据的过程并不需要master参与（寻址访问zookeeper和HRegion Server,数据读写过程访问Hregion Server）,HMaster仅仅维护着table和Region的元数据信息，负载很低。

19、关于hbase的表的属性及管理

-》memstore and blockcache

BlockCache 包含single（1/4用得少首先被清）mutilate（次数多重要其次被清）（1/2）in_memory不会被清用户表（1/4）

可以通过代码设置

scan.setCacheBlocks(flase)

-》compaction

->minior compaction

将多个比较小的最早生成的storefile合并成一个大的storefile文件

它不会删除被标记为“删除”的数据和以过期的数据，并且执行过一次minor合并操作后还会有多个storefile文件

->major compaction

把所有的storefile合并成一个单一的storefile文件，在文件合并期间系统会删除标记为"删除"标记的数据和过期失效的数据，

同时会block所有客户端对该操作所属的region的请求直到合并完毕，最后删除已合并的storefile文件。

-》flush

将内存中的数据手动溢写到磁盘

flush 'tbname'

flush 'REGIONNAME'

-》compact

Compact all regions in a table:

hbase> compact 'ns1:t1'

hbase> compact 't1'

Compact an entire region:

hbase> compact 'r1'

Compact only a column family within a region:

hbase> compact 'r1', 'c1'

Compact a column family within a table:

hbase> compact 't1', 'c1'

-》split

split 'tableName'

split 'namespace:tableName'

split 'regionName' # format: 'tableName,startKey,id'

split 'tableName', 'splitKey'

split 'regionName', 'splitKey'

-》balancer

-》move

-》hbase命令hbck(不在shell里面)

bin/hbase hbck 集群表的检查

bin/hbase hbck --help

-fixMeta Try to fix meta problems. This assumes HDFS region info is good.

.regioninfo

真正的删除是发生在compaction

读的时候放在缓存里面以便下次再读BlockCache read

Minor将数据多的小文件storfile重写为数量较少的大文件（多路归并受磁盘Io性能影响）

Major压缩合并，讲一个region中若干个Hfile重写为一个HFile。Major合并能扫描所有的键值对，顺序重写全部的数据，重写数据的过程中会忽略做了删除标记的数据。断言删除此时生效。进行compaction时会阻塞所有的客户端对所属region的请求直到合并完毕，最后删除已合并的storfile。（如果此时你一直往compaction的region进行写数据会一直分割合并恶性循环）

三种数据不会再被写入磁盘：1、超过版本号限制的数据

2、生存时间到期的数据

3、打上删除标记的数据

0 0