hbase编程指南

来源：互联网发布：宁德时代知乎编辑：程序博客网时间：2024/06/04 19:55

hbase编程指南

@(HBASE)[hbase, 大数据]

hbase编程指南
一概述
- 一创建项目
  - 1pomxml
  - 2在eclipse中运行的注意事项
  - 3关于addResource的说明
二Best Practices
三常用API
- 一创建Configuration及Connection对象
- 二表管理
  - 1创建表
  - 2判断表是否存在
  - 3删除表
- 三插入数据
  - 1插入单条数据
  - 2使用缓存
- 四读取数据单个数据和一批数据
  - 1遍历返回数据的方法
- 五append数据
- 六扫描表
- 七更改表结构
四常见异常
- - 1javaioIOException No FileSystem for scheme hdfs
  - 2UnknownHostException
  - 3NoSuchMethodErroraddFamily
  - 4SASL authentication failed

本文示范了如何创建表，删除表，更改表结构，还有put, get, scan等操作。

完整代码请见：
https://github.com/lujinhong/hbasecommons

一、概述

（一）创建项目

1、pom.xml

pom.xml中除了hbase以外，还需要添加hadoop相关的依赖：

    <dependency>        <groupId>org.apache.hbase</groupId>        <artifactId>hbase-client</artifactId>        <version>1.0.0</version>    </dependency>    <dependency>        <groupId>org.apache.hadoop</groupId>        <artifactId>hadoop-hdfs</artifactId>        <version>2.5.0</version>    </dependency>    <dependency>        <groupId>org.apache.hadoop</groupId>        <artifactId>hadoop-common</artifactId>        <version>2.5.0</version>    </dependency>          <dependency>         <groupId>org.apache.hadoop</groupId>         <artifactId>hadoop-client</artifactId>         <version>2.5.0</version>    </dependency>

2、在eclipse中运行的注意事项

将hadoop/hbase的配置文件加入classpath中。

3、关于addResource的说明

（1）一般情况下，以如下方式加载hbase的配置文件：

Configuration Connfiguration = HBaseConfiguration.create();Connfiguration.addResource(new Path(HbaseSiteXml));Connection = ConnectionFactory.createConnection(Connfiguration);

（2）如果没有使用Configuration对象，则只会加载classpath中的hbase-site.xml。
（3）如果使用String作为参数，则此hbase-site.xml必须在classpath中：

Connfiguration.addResource("/home/hadoop/hbase/hbase-site.xml");

（4）如果需要加载不在classpath中的hbase-site.xml，则需要使用Path对象：

Connfiguration.addResource(new Path(HbaseSiteXml));

注意这里的Path 不是java的Path，面是Hadoop的Path。

二、Best Practices

Connection是非常heavy的，但线程安全，一般而言，一个应用（严格来说应该是一个JVM）创建一个连接即可。如果实在需要创建多个，可以考虑pool，但一般不需要。
Table，Admin, Scanner都是lighweigh的，但非线程安全。
上面几个接口都必须记住close，但connection应该作好封装，避免频繁关闭与创建。另外，这几个接口都是AutoClosable的，可以使用新的try语法。
Use BufferedMutator for streaming / batch Puts. BufferedMutator replaces HTable.setAutoFlush(false) and is the supported high-performance streaming writes API.
在编译中必须使用CDH的jar包，而不是apache的，CDH为了兼容性做了一些修改，详最后一节。
建表时指定压缩方式，这会消耗一点CPU时间，但比起从磁盘读取数据所节省的时间，它基本可以忽略不计。create 'ljhtest2',{NAME => 'f1', COMPRESSION => 'SNAPPY'}
必须使用预分区。
对于MR等实时要求不高，或者数据量太大的情况下，加上scan/get.setCacheBlocks(false)。

三、常用API

（一）创建Configuration及Connection对象

在客户端中连接hbase，首先需要创建一个Connection对象，然后就可以使用connection获取Table, Admin, Scanner等对象，进行相应的操作。

Configuration config = HBaseConfiguration.create();Connection connection = ConnectionFactory.createConnection(config);

基本方法如上：
（1）创建一个Configuration对象，它会从classpath中查找hadoop/hbase的配置文件。
（2）创建Connection对象。

正如上面所如，创建connection是一个很heavy的操作，应该谨慎使用。最好将其封装在一个方法中getConnection()的方法中返回，而不是直接创建，同时考虑使用单例模式，如：

private static Connection connection = null;private HBaseHelper(Configuration conf) throws IOException {    configuration = conf;    connection = ConnectionFactory.createConnection(configuration);    this.admin = connection.getAdmin();}/* * 用于获取一个HBaseHelper对象的入口，需要提供一个Configuration对象，这个配置主要指定hbase-site.xml与core- * site.xml。 * 使用单例，保证只创建一个helper，因为每创建一个connection都是高代价的，如果需要多个连接，请使用Pool。 */public static HBaseHelper getHelper(Configuration configuration) throws IOException {    if (helper == null) {        helper = new HBaseHelper(configuration);    }    return helper;}

然后通过getConnection()方法获取到connection对象：

public Configuration getConfiguration() {    return configuration;}

由于HBaseHelper是单例对象，因此其成员变量也是只有一个的。
同时提供一个close()用于关掉Connection对象，因为如果用户随意关闭了connection，会导致需要经常重新创建Connection对象：

@Overridepublic void close() throws IOException {    admin.close();    connection.close();}

这个方法只会在整个应用关闭后才应该调用，比如某些框架的cleanUp()方法等，一般情况下只要应用程序还在运行就不应该调用这个方法。

由于HBaseHelper实现了较多功能，所以这里将HBaseHelper设为单例，如果只需要将Connection设为单例也是可以的，此时代码相对简单。

private static Connection connection = null;public static Connection getConnection(Configuration config) throws IOException {    if (connection == null) {        connection = ConnectionFactory.createConnection(config);    }    return connection;}

（二）表管理

1、创建表

创建表的完整应用如下：

public void createTable(TableName table, int maxVersions, byte[][] splitKeys, String... colfams)        throws IOException {    HTableDescriptor desc = new HTableDescriptor(table);    for (String cf : colfams) {        HColumnDescriptor coldef = new HColumnDescriptor(cf);        coldef.setMaxVersions(maxVersions);        desc.addFamily(coldef);    }    if (splitKeys != null) {        admin.createTable(desc, splitKeys);    } else {        admin.createTable(desc);    }}

几个参数的意思分别为表名，最多保留多少个版本，用于预分区的keys，family的名称。

使用预分区创建表，形式如byte[][] splits = new byte[][]{Bytes.toBytes(“row2000id”),Bytes.toBytes(“row4000id”),Bytes.toBytes(“row6000id”),Bytes.toBytes(“row8000id”)};

同时还应封装将见的应用方式：

public void createTable(String table, String... colfams) throws IOException {    createTable(TableName.valueOf(table), 1, null, colfams);}public void createTable(TableName table, String... colfams) throws IOException {    createTable(table, 1, null, colfams);}public void createTable(String table, int maxVersions, String... colfams) throws IOException {    createTable(TableName.valueOf(table), maxVersions, null, colfams);}public void createTable(TableName table, int maxVersions, String... colfams) throws IOException {    createTable(table, maxVersions, null, colfams);}public void createTable(String table, byte[][] splitKeys, String... colfams) throws IOException {    createTable(TableName.valueOf(table), 1, splitKeys, colfams);}

关键步骤为：
（1）获取一个Admin对象，用于管理表。这个对象在HBaseHelper中创建了，所以这里就不创建了。
（2）创建一个HTableDescriptor对象，表示一个表，但这个表还不存在。与下面的Table类对比。这个对象还可以设置很多属性，如压缩格式，文件大小等。
（3）判断表是否已经存在，若存在的话，先disable, 然后delete。
（4）创建表。

Admin, HTableDescriptor对象都是轻量级的，只要有需要就可以创建，

2、判断表是否存在

public boolean existsTable(String table) throws IOException {    return existsTable(TableName.valueOf(table));}public boolean existsTable(TableName table) throws IOException {    return admin.tableExists(table);}

其实上面的代码就是直接调用hbase API的tableExists()方法，但不需要每次重新创建admin对象等。

3、删除表

public void disableTable(String table) throws IOException {    disableTable(TableName.valueOf(table));}public void disableTable(TableName table) throws IOException {    admin.disableTable(table);}public void dropTable(String table) throws IOException {    dropTable(TableName.valueOf(table));}public void dropTable(TableName table) throws IOException {    if (existsTable(table)) {        if (admin.isTableEnabled(table))            disableTable(table);        admin.deleteTable(table);    }}

（三）插入数据

1、插入单条数据

下面定义了各种常见的put方式，最后一种其实并不常用。

    public void put(String table, String row, String fam, String qual, String val) throws IOException {    put(TableName.valueOf(table), row, fam, qual, val);}public void put(TableName table, String row, String fam, String qual, String val) throws IOException {    Table tbl = connection.getTable(table);    Put put = new Put(Bytes.toBytes(row));    put.addColumn(Bytes.toBytes(fam), Bytes.toBytes(qual), Bytes.toBytes(val));    tbl.put(put);    tbl.close();}public void put(String table, String row, String fam, String qual, long ts, String val) throws IOException {    put(TableName.valueOf(table), row, fam, qual, ts, val);}public void put(TableName table, String row, String fam, String qual, long ts, String val) throws IOException {    Table tbl = connection.getTable(table);    Put put = new Put(Bytes.toBytes(row));    put.addColumn(Bytes.toBytes(fam), Bytes.toBytes(qual), ts, Bytes.toBytes(val));    tbl.put(put);    tbl.close();}public void put(String table, String[] rows, String[] fams, String[] quals, long[] ts, String[] vals)        throws IOException {    put(TableName.valueOf(table), rows, fams, quals, ts, vals);}public void put(TableName table, String[] rows, String[] fams, String[] quals, long[] ts, String[] vals)        throws IOException {    Table tbl = connection.getTable(table);    for (String row : rows) {        Put put = new Put(Bytes.toBytes(row));        for (String fam : fams) {            int v = 0;            for (String qual : quals) {                String val = vals[v < vals.length ? v : vals.length - 1];                long t = ts[v < ts.length ? v : ts.length - 1];                System.out.println("Adding: " + row + " " + fam + " " + qual + " " + t + " " + val);                put.addColumn(Bytes.toBytes(fam), Bytes.toBytes(qual), t, Bytes.toBytes(val));                v++;            }        }        tbl.put(put);    }    tbl.close();}

这里每次put一个数据均会创建一个Table对象，然后close这个对象。虽然说这个对象是轻量级的，但如果发生一个循环里面，则不断的创建及destory对象还是会有较大的消耗的，这种情况应该考虑复用Table对象，或者使用下面介绍的缓存技术。

2、使用缓存

在hbase1.0.0以后，使用BufferedMutator处理缓存，这些数据会先在客户端中保存，直到缓冲区满了，或者是显示调用flush方法数据才会通过PRC请求发送至hbase。

/* * 将一系列的数据put进table的fam:qual中，由rows和vals来定义写入的数据，它们的长期必须相等。 */public  void put(String table, String[] rows, String fam, String qual,         String[] vals) throws IOException {    if (rows.length != vals.length) {        LOG.error("rows.lenght {} is not equal to val.length {}", rows.length, vals.length);    }    try (BufferedMutator mutator = connection.getBufferedMutator(TableName.valueOf(table));) {        for (int i = 0; i < rows.length; i++) {            Put p = new Put(Bytes.toBytes(rows[i]));            p.addColumn(Bytes.toBytes(fam), Bytes.toBytes(qual), Bytes.toBytes(vals[i]));            mutator.mutate(p);            //System.out.println(mutator.getWriteBufferSize());        }        mutator.flush();    }}public void put(String table, String[] rows, String fam, String qual, String[] vals) throws IOException {    put(TableName.valueOf(table), rows, fam, qual, vals);}

最后的输出是缓冲区大小，默认是2M，由参数hbase.client.write.buffer.决定。可以通过下面方法得到：

mutator.getWriteBufferSize()

怎样设置缓冲区大小呢？

（四）读取数据：单个数据和一批数据

/* * 获取table表中，所有rows行中的，fam:qual列的值。 */public Result get(String table, String row, String fam, String qual) throws IOException {    return get(TableName.valueOf(table), new String[]{row}, new String[]{fam}, new String[]{qual})[0];}public Result get(TableName table, String row, String fam, String qual) throws IOException {    return get(table, new String[]{row}, new String[]{fam}, new String[]{qual})[0];}public Result[] get(TableName table, String[] rows, String fam, String qual) throws IOException {    return get(table, rows, new String[]{fam}, new String[]{qual});}public Result[] get(String table, String[] rows, String fam, String qual) throws IOException {    return get(TableName.valueOf(table), rows, new String[]{fam}, new String[]{qual});}public Result[] get(String table, String[] rows, String[] fams, String[] quals) throws IOException {    return get(TableName.valueOf(table), rows, fams, quals);}/* * 获取table表中，所有rows行中的，fams和quals定义的所有行。 */public Result[] get(TableName table, String[] rows, String[] fams, String[] quals) throws IOException {    Table tbl = connection.getTable(table);    List<Get> gets = new ArrayList<Get>();    for (String row : rows) {        Get get = new Get(Bytes.toBytes(row));        get.setMaxVersions();        if (fams != null) {            for (String fam : fams) {                for (String qual : quals) {                    get.addColumn(Bytes.toBytes(fam), Bytes.toBytes(qual));                }            }        }        gets.add(get);    }    Result[] results = tbl.get(gets);    tbl.close();    return results;}

1、遍历返回数据的方法

    for (Result result : results) {        for (Cell cell : result.rawCells()) {            System.out.println("Cell: " + cell + ", Value: "                    + Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));        }    }

如果直接调用result.toString()，则只返回前面那部分，即cell，而没有value部分。

（五）append数据

append既可以给一行append新的一列，也可以给一列里面的内容append新的内容。

List<Append> appends = new ArrayList<Append>();    for (int i = 0; i < lineCount;i = i + 1) {        int r_int = rand.nextInt(tasks.size());        String rowid = "1";        Append append = new Append(Bytes.toBytes(rowid));        append.add(Bytes.toBytes("cf"), Bytes.toBytes("qual_"), Bytes.toBytes("test" + i));        //table.append(append);        appends.add(append);        System.out.println("appending: " + i);    }    table.batch(appends);    table.close();

（六）扫描表

将表打印出来：

public void dump(String table) throws IOException {    dump(TableName.valueOf(table));}public void dump(TableName table) throws IOException {    try (Table t = connection.getTable(table); ResultScanner scanner = t.getScanner(new Scan())) {        for (Result result : scanner) {            dumpResult(result);        }    }}public void dumpResult(Result result) {    for (Cell cell : result.rawCells()) {        System.out.println("Cell: " + cell + ", Value: "                + Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));    }}

Scanner的另一种常见用法：

    Scan scan = new Scan();    scan.addFamily(Bytes.toBytes(family));    Filter filter = new PrefixFilter(Bytes.toBytes(rowkeyPrefix));    scan.setFilter(filter);    ResultScanner scanner = table.getScanner(scan);

（七）更改表结构

//有问题，而且一般不建议在代码中更改表结构。public static void modifySchema(Connection connection) throws IOException {    try (Admin admin = connection.getAdmin()) {        TableName tableName = TableName.valueOf(TABLE_NAME);        if (!admin.tableExists(tableName)) {            System.out.println("Table does not exist.");            System.exit(-1);        }        HTableDescriptor table = new HTableDescriptor(tableName);        // Update existing table        HColumnDescriptor newColumn = new HColumnDescriptor("NEWCF");        newColumn.setCompactionCompressionType(Algorithm.GZ);        newColumn.setMaxVersions(HConstants.ALL_VERSIONS);        admin.addColumn(tableName, newColumn);        // Update existing column family        HColumnDescriptor existingColumn = new HColumnDescriptor(FAMILY);        existingColumn.setCompactionCompressionType(Algorithm.GZ);        existingColumn.setMaxVersions(HConstants.ALL_VERSIONS);        table.modifyFamily(existingColumn);        admin.modifyTable(tableName, table);        // Disable an existing table        admin.disableTable(tableName);        // Delete an existing column family        admin.deleteColumn(tableName, FAMILY.getBytes("UTF-8"));        // Delete a table (Need to be disabled first)        admin.deleteTable(tableName);    }}

四、常见异常

1、java.io.IOException: No FileSystem for scheme: hdfs

解决方法：将hadoop相关的jar包添加至classpath中。

2、UnknownHostException

Caused by: java.net.UnknownHostException: logstreaming

上面的logstreaming是hdfs的集群URL，这里表示未能正确加载hadoop的配置。解决办法：

export HADOOP_CONF_DIR=/home/hadoop/conf_loghbase

3、NoSuchMethodError：addFamily

Exception in thread “main” java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)Lorg/apache/hadoop/hbase/HTableDescriptor;
at co.cask.hbasetest.HBaseTest.createTable(HBaseTest.java:34)
at co.cask.hbasetest.HBaseTest.doMain(HBaseTest.java:49)
at co.cask.hbasetest.HBaseTest.main(HBaseTest.java:67)

在1.0.0之后，apache hbase将addFamily的返回值从void改成了HTableDescriptor，但CDH没改，因此如果使用其中一个作编译，另一个作为运行环境，则会出现上述错误。
解决办法：
使用同一版本编译。
如果使用CDH，则需要添加：

<repositories>        <repository>            <id>cloudera</id>            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>        </repository>    </repositories>

然后指定CDH的版本：

    <dependency>        <groupId>org.apache.hbase</groupId>        <artifactId>hbase-client</artifactId>        <version>1.0.0-cdh5.4.5</version>    </dependency>

4、SASL authentication failed.

出现以下错误，提示没有kinit。但事实上你已经kinit。其中一个原因是要使用

java -cp `hbase classpath`:yourjar.jar Main

来运行任务，不能用

java -cp `/home/hadoop/hbase/lib`:yourjar.jar Main Caused by: java.io.IOException: Could not set up IO Streams to gdc-dn152-formal.i.nease.net/10.160.254.123:60020        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:773)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:890)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:859)        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1193)        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:32627)        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1583)        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1293)        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1125)        at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:299)        ... 9 moreCaused by: java.lang.RuntimeException: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:673)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:631)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:739)        ... 19 moreCaused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)        at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:415)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)        ... 19 more

阅读全文

0 0