Lily HBase Indexer在CDH中的基本使用

来源：互联网发布：创新发展知乎编辑：程序博客网时间：2024/05/17 00:03

1. 简介

CDH上的Key-Value Store Indexer服务使用的是Lily HBase Indexer。Lily HBase Indexer是一款灵活的、可扩展的、高容错的，并且近实时的处理hbase列索引数据的软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码，源代码托管在github上。Lily HBase Indexer依赖于hbase的replication功能，在hbase进行写入、更新或者删除操作的时候，HBase Indexer将监听到这些操作，以此将数据的增删改同步到Solr里面。Hbase Indexer使用SolrCloud来存储hbase的索引数据。HBase Indexer支持用户自定义的抽取，转换规则来索引hbase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就通过solr直接检索hbase的列数据。而且HBase Indexer索引和搜索不会影响hbase运行的稳定性和hbase数据写入的吞吐量，因为索引和搜索过程是完全分开并且异步的。

2. 使用

CDH5.4中已经整合了Lily HBase Indexer服务。在Cloudera Manager管理界面上安装Key-Value Store Indexer服务之后，开始测试使用hbase-indexer相关功能。

CDH5.4.2中的Key-Value Store Indexer使用的是Lily HBase Indexer服务。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。

2.1 在 hbase 表列族上启用复制

对于已经存在的hbase表，修改表中需要索引的列族的REPLICATION_SCOPE为1，如下所示：

$ hbase shellhbase shell> disable 'record'hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}hbase shell> enable 'record'

对于每个新表，创建时指定需要索引的列族的REPLICATION_SCOPE为1，如下所示：

$ hbase shellhbase shell> create 'record', {NAME => 'data', REPLICATION_SCOPE => 1}

2.2 创建相应的 SolrCloud 集合

创建的SolrCloud 集合字段要包括所有需要索引的hbase列。通过如下命令实例化SolrCloud配置信息并创建SolrCloud：

$ solrctl instancedir --generate $HOME/hbase-collection1$ edit $HOME/hbase-collection1/conf/schema.xml$ solrctl instancedir --create hbase-collection1 $HOME/hbase-collection1$ solrctl collection --create hbase-collection1

【说明】
（1）每个需要索引的hbase列对应于schema中的一个<field>
（2）在schema.xml中 uniqueKey 必须为 hbase 表的 rowkey ,而 rowkey 默认使用 id 字段表示，所以 <field> 配置中必须要有 id 字段。

2.3 创建 Lily HBase Indexer 配置

$ cat $HOME/morphline-hbase-mapper.xml<?xml version="1.0"?><indexer table="record" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"><!—如果使用CM来管理，则使用相对路径 "morphlines.conf" --><param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/><!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf，value对应morphlines.conf的id属性 --><!-- <param name="morphlineId" value="morphline1"/> --></indexer>

【说明】 其中table表示需要索引的hbase表，如上面的配置指定为record表；mapper表示用来实现和读取指定的Morphline配置文件类，固定为 MorphlineResultToSolrMapper。morphlineFile参数用来指定当前配置为morphlineFile文件所在的路径。如果是使用Cloudera Manager来管理morphlines.conf就直接写入值“morphlines.conf”。否则使用绝对路径来指定具体的morphlines.conf文件。morphlineId参数指定对应
morphlines.conf的id属性。

morphline-hbase-mapper.xml在<indexer>节点里面可以通过unique-key-field指定hbase rowkey将被映射的solr字段名，默认值为id字段，如果需要指定映射为其他字段名，通过配置unique-key-field来实现，如下所示：

<indexer table="record" unique-key-field="rowkey" ...> ... </indexer>

【注意】 unique-key-field的值应该与SolrCloud schema.xml里面的uniqueKey字段名相对应。

2.4 创建 Morphline 配置文件

Morphlines是一款开源的，用来减少构建hadoop ETL数据流程时间的应用程序。它可以替代传统的通过MapReduce来抽取、转换、加载数据的过程，提供了一系列的命令工具。对于HBase Indexer，其提供了extractHBaseCells命令来读取HBase的列数据。我们采用Cloudera Manager来管理morphlines.conf文件。

使用CM来管理morphlines.conf文件除了上面提到的好处之外，还有一个好处就是当我们需要增加索引列的时候，如果采用本地路径方式将需要重新注册Lily HBase Indexer的配置文件，而采用CM管理的话只需要修改morphlines.conf文件后重启Key-Value HBase Indexer服务即可。

具体操作为：进入Key-Value Store Indexer面板 -> 配置 -> 服务范围 -> Morphlines -> Morphlines文件。在该选项加入如下配置：

morphlines : [  {    id : morphline1    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]    commands : [                          {        extractHBaseCells {          mappings : [            {              inputColumn : "data:id"              outputField : "id"                type : string               source : value            }      { logTrace { format : "output record: {}", args : ["@{}"] } }    ]  }]

【说明】

id : 与 morphline-hbase-mapper.xml 里面配置的 morphlineId 参数对应。
importCommands : 需要引入的命令包地址。
extractHBaseCells：该命令用来读取HBase列数据并写入到SolrInputDocument对象中，该命令必须包含零个或者多个mappings命令对象。
mappings : 用来指定hbase列字段与solr之间的映射。
inputColumn : 需要写入到solr中的hbase列字段。值包含列族和列限定符，并用 : 分开。其中列限定符也可以使用通配符 * 来表示，譬如可以使用 data:* 表示索引列族为data的所有列；也可以通过 data:my* 来表示索引列族为data中以my开头的字段。
outputField : 指定 inputColumn 与 solr 的 schema.xml 文件的哪个字段名 (<field>) 进行映射，否则写入不正确。
type : 指定hbase列值的映射数据类型，我们知道hbase中的数据都是以byte[]的形式保存，但是所有的内容在Solr中索引为text 形式，所以需要一个方法来把byte[]类型转换为实际的数据类型。type参数的值就是用来做这件事情的。现在支持的数据类型有：byte[] (原封不动的拷贝hbase中的byte[]数据),int,long,string,boolean,float,double,short和 bigdecimal。当然你也可以指定自定义的数据类型，只需要实现 com.ngdata.hbaseindexer.parse.ByteArrayValueMapper接口即可。
source : 用来指定hbase的KeyValue的哪一部分作为索引输入数据，可选的有 value 和 qualifier , 当取值为value的时候表示使用hbase的列值作为索引输入，当取值为qualifier的时候表示使用hbase的列限定符作为索引输入。

2.5 注册 Lily HBase Indexer配置

当前面的所有步骤完成之后，我们需要把Lily HBase Indexer的配置文件注册到Zookeeper中，使用如下命令：

hbase-indexer add-indexer -n myIndexer \ -c $HOME/morphline-hbase-mapper.xml \-cp solr.zk=Node03:2181,Node04:2181,Node05:2181/solr \-cp solr.collection=coll1 \-z Node03:2181,Node04:2181,Node05:2181

-n : –name
-c : –indexer-conf
-cp : –connection-param
-z : –zookeeper

更多介绍可以通过如下命令查看：

hbase-indexer add-indexer --help

注册后，可以验证是否注册成功：

$ hbase-indexer list-indexers

2.6 验证索引是否正常工作

往hbase写入数据

$ hbase shellhbase(main):001:0> put 'record', 'row1', 'data:id', '1'hbase(main):002:0> put 'record', 'row2', 'data:id', '2'

打开solr web ui查看数据同步情况

3. 参考

Using the Lily HBase NRT Indexer Service
Using the Lily HBase Batch Indexer for Indexing

0 0