Hbase详解

来源：互联网发布：河北云狐网络编辑：程序博客网时间：2024/06/05 10:27

HBase架构图理解

18.png

HMaster链接Zookeeper的目得：HMaster需要知道哪些HRegionServere是活的及HRegionServer所在的位置，然后管理HRegionServer。
HBase内部是通过DFS client把数据写到HDFS上的
每一个HRegionServer有多个HRegion，每一个HRegion有多个Store，每一个Store对应一个列簇。
HFile是HBase中KeyValue数据的存储格式，HFile是Hadoop的二进制格式文件，StoreFile就是对HFile进行了封装，然后进行数据的存储。
HStore由MemStore和StoreFile组成。
HLog记录数据的所有变更，可以用来做数据恢复。
hdfs对应的目录结构为
namespace->table->列簇->列->单元格

17.png

写数据流程

zookeeper中存储了meta表的region信息，从meta表获取相应region信息，然后找到meta表的数据
根据namespace、表名和rowkey根据meta表的数据找到写入数据对应的region信息
找到对应的regionserver
把数据分别写到HLog和MemStore上一份
MemStore达到一个阈值后则把数据刷成一个StoreFile文件。若MemStore中的数据有丢失，则可以总HLog上恢复
当多个StoreFile文件达到一定的大小后，会触发Compact合并操作，合并为一个StoreFile，这里同时进行版本的合并和数据删除。
当Compact后，逐步形成越来越大的StoreFIle后，会触发Split操作，把当前的StoreFile分成两个，这里相当于把一个大的region分割成两个region。如下图：

19.png

读数据流程

zookeeper中存储了meta表的region信息，所以先从zookeeper中找到meta表region的位置，然后读取meta表中的数据。meta中又存储了用户表的region信息。
根据namespace、表名和rowkey在meta表中找到对应的region信息
找到这个region对应的regionserver
查找对应的region
先从MemStore找数据，如果没有，再到StoreFile上读(为了读取的效率)。

HBase Java API基本使用

package org.apache.hadoop.hbase;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.client.Delete;import org.apache.hadoop.hbase.client.Get;import org.apache.hadoop.hbase.client.HTable;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.ResultScanner;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.filter.Filter;import org.apache.hadoop.hbase.filter.PrefixFilter;import org.apache.hadoop.hbase.util.Bytes;public class HbaseClientTest {    /*     * 跟去表名获取表的实例     */    public static HTable getTable (String name) throws Exception{        //get the hbase conf instance        Configuration conf = HBaseConfiguration.create();        //get the hbase table instance        HTable table = new HTable(conf, name);        return table;    }    /**     * get the data from the hbase table      *      * get 'tbname','rowkey','cf:col'     *      * 列簇-》列名-》value-》timestamp     */    public static void getData(HTable table) throws Exception {        // TODO Auto-generated method stub        Get get = new Get(Bytes.toBytes("20161119_10003"));        //conf the get         //get.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));        get.addFamily(Bytes.toBytes("info"));        //load the get         Result rs = table.get(get);        //print the data        for(Cell cell : rs.rawCells()){            System.out.println(                    Bytes.toString(CellUtil.cloneFamily(cell))                    +"->"+                    Bytes.toString(CellUtil.cloneQualifier(cell))                    +"->"+                    Bytes.toString(CellUtil.cloneValue(cell))                    +"->"+                    cell.getTimestamp()                    );            System.out.println("------------------------------");        }    }    /**     * put the data to the hbase table      *      * put 'tbname','rowkey','cf:col','value'     *              */    public static void putData(HTable table) throws Exception {        //get the put instance        Put put = new Put(Bytes.toBytes("20161119_10003"));        //conf the put        put.add(                Bytes.toBytes("info"),                 Bytes.toBytes("age"),                 Bytes.toBytes("20")                );        //load the put         table.put(put);        //print        getData(table);    }    /**     * delete the data from the hbase table      *      * delete 'tbname','rowkey','cf:col'     *              */    public static void deleteData(HTable table) throws Exception {        //get the delete instance        Delete del = new Delete(Bytes.toBytes("20161119_10003"));        //conf the del        //del.deleteColumn(Bytes.toBytes("info"),Bytes.toBytes("age"));        del.deleteColumns(Bytes.toBytes("info"),Bytes.toBytes("age"));        //load the del        table.delete(del);        //print        getData(table);    }    /**     * scan the all table     * scan 'tbname'     *              */    public static void scanData(HTable table) throws Exception {        //get the scan instance        Scan scan = new Scan();        //load the scan        ResultScanner rsscan = table.getScanner(scan);        for(Result rs : rsscan){            System.out.println(Bytes.toString(rs.getRow()));            for(Cell cell : rs.rawCells()){                System.out.println(                        Bytes.toString(CellUtil.cloneFamily(cell))                        +"->"+                        Bytes.toString(CellUtil.cloneQualifier(cell))                        +"->"+                        Bytes.toString(CellUtil.cloneValue(cell))                        +"->"+                        cell.getTimestamp()                        );            }            System.out.println("------------------------------");        }    }    /**     * scan the table  with limit     *      * scan 'tbname',{STARTROW => 'row1',STOPROW => 'row2'}     */    public static void rangeData(HTable table) throws Exception {        //get the scan instance        Scan scan = new Scan();        //conf the scan            //scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));            //scan.addFamily(family);            //scan.setStartRow(Bytes.toBytes("20161119_10002"));            //scan.setStopRow(Bytes.toBytes("20161119_10003"));        Filter filter = new PrefixFilter(Bytes.toBytes("2016111"));        scan.setFilter(filter);        //hbase conf        //是否启动缓存        scan.setCacheBlocks(true);        //设置缓存的条数        scan.setCaching(100);        //每一次取多少条        scan.setBatch(10);        //共同决定了请求RPC的次数        //load the scan        ResultScanner rsscan = table.getScanner(scan);        for(Result rs : rsscan){            System.out.println(Bytes.toString(rs.getRow()));            for(Cell cell : rs.rawCells()){                System.out.println(                        Bytes.toString(CellUtil.cloneFamily(cell))                        +"->"+                        Bytes.toString(CellUtil.cloneQualifier(cell))                        +"->"+                        Bytes.toString(CellUtil.cloneValue(cell))                        +"->"+                        cell.getTimestamp()                        );            }            System.out.println("------------------------------");        }    }    public static void main(String[] args) throws Exception {        HTable table = getTable("test:tb1");        getData(table);        putData(table);        deleteData(table);        scanData(table);        rangeData(table);    }    }

HBase架构中各个模块的功能再次总结

Client
整个HBase集群的访问入口；
使用HBase RPC机制与HMaster和HRegionServer进行通信；
与HMaster进行通信进行管理表的操作；
与HRegionServer进行数据读写类操作；
包含访问HBase的接口，并维护cache来加快对HBase的访问
Zookeeper
保证任何时候，集群中只有一个HMaster；
存贮所有HRegion的寻址入口；
实时监控HRegion Server的上线和下线信息，并实时通知给HMaster；
存储HBase的schema和table元数据；
Zookeeper Quorum存储表地址、HMaster地址。
HMaster
HMaster没有单点问题，HBase中可以启动多个HMaster，通过Zookeeper的Master Election机制保证总有一个Master在运行，主负责Table和Region的管理工作。
管理用户对表的创建、删除等操作；
管理HRegionServer的负载均衡，调整Region分布；
Region Split后，负责新Region的分布；
在HRegionServer停机后，负责失效HRegionServer上Region迁移工作。
HRegion Server
维护HRegion，处理对这些HRegion的IO请求，向HDFS文件系统中读写数据；
负责切分在运行过程中变得过大的HRegion。
Client访问hbase上数据的过程并不需要master参与（寻址访问Zookeeper和HRegion Server，数据读写访问HRegione Server），HMaster仅仅维护这table和Region的元数据信息，负载很低。

hbase与mapreduce的集成

可以把hbase表中的数据作为mapreduce计算框架的输入，或者把mapreduce的计算结果输出到hbase表中。
我们以hbase中自带的mapreduce程序举例

直接运行会发现报错缺少jar包，所以运行前需引入环境变量

$ export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2 $ export HADOOP_HOME=/opt/modules/hadoop-2.5.0  # $HBASE_HOME/bin/hbase mapredcp可以列出hbase在yarn上运行所需的jar包$ export HADOOP_CLASSPATH=`$HBASE_HOME/bin/hbase mapredcp`

运行示例

$ $HADOOP_HOME/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar rowcounter  test:tb1

HBase的数据迁移的importsv的使用

HBase数据来源于日志文件或者RDBMS，把数据迁移到HBase表中。常见的有三种方法：（1）使用HBase Put API；（2）使用HBase批量加载工具；（3）自定义MapReduce job实现。
importtsv是HBase官方提供的基于mapreduce的批量数据导入工具，同时也是hbase提供的一个命令行工具，可以将存储在HDFS上的自定义分隔符(默认是\t)的数据文件，通过一条命令方便的导入到HBase中。
测试

准备数据文件

[wulei@bigdata-00 datas]$ cat tb1.tsv 10001   zhangsan        2010002   lisi    2210003   wangwu  30

把数据文件上传到hdsf上

$ bin/hdfs dfs -put /opt/datas/tb1.tsv /

在hbase中创建表
> create 'student','info'
将HDFS中的数据导入到hbase表中
```
$HADOOP_HOME/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv  -Dimporttsv.separator=\t -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age  student  /tb1.tsv
```
Dimporttsv.columns为指定分隔符
Dimporttsv.columns指定数据文件中每一列如何对应表中的rowkey和列
/tb1.tsv为hdfs上的数据文件的路径

查看执行结果

hbase(main):010:0> scan 'student'ROW                       COLUMN+CELL                                                              10001                    column=info:age, timestamp=1480123167099, value=20                       10001                    column=info:name, timestamp=1480123167099, value=zhangsan                10002                    column=info:age, timestamp=1480123167099, value=22                       10002                    column=info:name, timestamp=1480123167099, value=lisi                    2 row(s) in 0.8210 seconds

阅读全文

0 0