HBase高级特性：通过Coprocessor实现Solr Cloud二级索引

来源：互联网发布：linux 删除组成员编辑：程序博客网时间：2024/05/02 03:08

一、概念

协处理器分两种类型，系统协处理器可以全局导入region server上的所有数据表，表协处理器即是用户可以指定一张表使用协处理器。

HBase的coprocessor分为两类，Observer和EndPoint。其中Observer相当于触发器，EndPoint相当于存储过程。其中Observer的代码部署在服务端，相当于对API调用的代理。

另一个是终端(endpoint)，动态的终端有点像存储过程。

Observer

观察者的设计意图是允许用户通过插入代码来重载协处理器框架的upcall方法，而具体的事件触发的callback方法由HBase的核心代码来执行。协处理器框架处理所有的callback调用细节，协处理器自身只需要插入添加或者改变的功能。以HBase0.92版本为例，它提供了三种观察者接口：

RegionObserver：提供客户端的数据操纵事件钩子：Get、Put、Delete、Scan等。

WALObserver：提供WAL相关操作钩子。

MasterObserver：提供DDL-类型的操作钩子。如创建、删除、修改数据表等。

这些接口可以同时使用在同一个地方，按照不同优先级顺序执行.用户可以任意基于协处理器实现复杂的HBase功能层。HBase有很多种事件可以触发观察者方法，这些事件与方法从HBase0.92版本起，都会集成在HBase API中。不过这些API可能会由于各种原因有所改动，不同版本的接口改动比较大。

二、实现方法

其实hbase结合solr实现方法还是比较简单的，重点在于一些实现细节上。将hbase记录写入solr的关键就在于hbase提供的Coprocessor，Coprocessor提供了两个实现：endpoint和observer，endpoint相当于关系型数据库的存储过程，而observer则相当于触发器。说到这相信大家应该就明白了，我们要利用的就是observer。observer允许我们在记录put前后做一些处理，而我们就是通过postPut将记录同步写入solr（关于Coprocessor具体内容请自行查资料）。

而写入solr这块就比较简单了，如果是单机就使用ConcurrentUpdateSolrServer，如果是集群就是用CloudSolrServer。不过这里需要注意的是由于CloudSolrServer不像ConcurrentUpdateSolrServer那样内置缓存，默认情况下hbase没写一条数据就会向solr提交一次，这样速度会非常慢（很可能hbase写完很久solr这边还在提交），因此要自己实现一个缓存池，根据hbase的写入速度动态调整，并批量向solr提交。

三、代码实现

首先看下Coprocessor的代码：

public class SolrIndexCoprocessorObserver extends BaseRegionObserver{

private static Logger log = Logger.getLogger(SolrIndexCoprocessorObserver.class);

@Override
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e,
Put put, WALEdit edit, Durability durability) throws IOException {
        String rowKey = Bytes.toString(put.getRow());
        try {
//从HBase中读取指定列信息数据作为solr索引
            //信息标识
Cell cellMsg_id = put.get(Bytes.toBytes("data"),Bytes.toBytes("msg_id")).get(0);
String msg_id = new String(CellUtil.cloneValue(cellMsg_id));
//接收短信的MSISDN号码
Cell cellDest_phone = put.get(Bytes.toBytes("data"),Bytes.toBytes("dest_phone")).get(0);
String dest_phone = new String(CellUtil.cloneValue(cellDest_phone));
//短信内容
Cell cellMsg_Content_tmp = put.get(Bytes.toBytes("data"),Bytes.toBytes("msg_Content_tmp")).get(0);
String msg_Content_tmp = new String(CellUtil.cloneValue(cellMsg_Content_tmp));
//短信下发时间
Cell cellTimestr = put.get(Bytes.toBytes("data"),Bytes.toBytes("timestr")).get(0);
String timestr = new String(CellUtil.cloneValue(cellTimestr));

//缓冲构造器
VmMoney vm = new VmMoney();
vm.setMsg_id(msg_id);
vm.setDest_phone(dest_phone);
vm.setMsg_Content_tmp(msg_Content_tmp);
vm.setTimestr(timestr);

//写入缓冲
SolrWriter so = new SolrWriter();
so.addDocToCache(vm);
} catch (Exception ex) {
log.info("write " + rowKey + " to solr fail:" + ex.getMessage());
ex.printStackTrace();
}
    }
}

下面的代码就是在hbase写入后SolrWrite进行处理，实现如下：

public class SolrWriter {
private static Logger log = Logger.getLogger(SolrWriter.class);

    public static String urlSolr = "main:2181";     //solr地址
private static String defaultCollection = "record";  //默认collection
private static int zkClientTimeOut =20000;//zk客户端请求超时间
private static int zkConnectTimeOut =10000;//zk客户端连接超时间
private static CloudSolrServer solrserver = null;

    private static int maxCacheCount = 1;   //缓存大小，当达到该上限时提交
private static Vector<VmMoney> cache = null;   //缓存
public  static Lock commitLock =new ReentrantLock();  //在添加缓存或进行提交时加锁

private static int maxCommitTime = 60; //最大提交时间，s

static {
        Configuration conf = HBaseConfiguration.create();
urlSolr = conf.get("hbase.solr.zklist", "main:2181");
defaultCollection = conf.get("hbase.solr.collection","record");
zkClientTimeOut = conf.getInt("hbase.solr.zkClientTimeOut", 10000);
zkConnectTimeOut = conf.getInt("hbase.solr.zkConnectTimeOut", 10000);
maxCacheCount = conf.getInt("hbase.solr.maxCacheCount", 10000);
maxCommitTime =  conf.getInt("hbase.solr.maxCommitTime", 60*5);

log.info("solr init param"+urlSolr+"  "+defaultCollection+"  "+zkClientTimeOut+"  "+zkConnectTimeOut+"  "+maxCacheCount+"  "+maxCommitTime);
        try {
cache=new Vector<VmMoney>(maxCacheCount);

solrserver = new CloudSolrServer(urlSolr);
solrserver.setDefaultCollection(defaultCollection);
solrserver.setZkClientTimeout(zkClientTimeOut);
solrserver.setZkConnectTimeout(zkConnectTimeOut);

//启动定时任务，第一次延迟10执行,之后每隔指定时间执行一次
Timer timer=new Timer();
timer.schedule(new CommitTimer(),10*1000,maxCommitTime*1000);
} catch (Exception ex){
            ex.printStackTrace();
}

    }

/**
     * 批量提交
     */
public void inputDoc(List<VmMoney> vmMoneyList) throws IOException, SolrServerException {
if (vmMoneyList == null || vmMoneyList.size() == 0) {
return;
}
        List<SolrInputDocument> doclist= new ArrayList<SolrInputDocument>(vmMoneyList.size());
        for (VmMoney vm : vmMoneyList) {
            SolrInputDocument doc = new SolrInputDocument();
doc.addField("msg_id", vm.getMsg_id());
doc.addField("dest_phone", vm.getDest_phone());
doc.addField("msg_Content_tmp", vm.getMsg_Content_tmp());
doc.addField("timestr", vm.getTimestr());

doclist.add(doc);
}
solrserver.add(doclist);
}

/**
     * 单条提交
     */
public void inputDoc(VmMoney vmMoney) throws IOException, SolrServerException {
if (vmMoney == null) {
return;
}
        SolrInputDocument doc = new SolrInputDocument();
doc.addField("msg_id", vmMoney.getMsg_id());
doc.addField("dest_phone", vmMoney.getDest_phone());
doc.addField("msg_Content_tmp", vmMoney.getMsg_Content_tmp());
doc.addField("timestr", vmMoney.getTimestr());

solrserver.add(doc);

}

public void deleteDoc(List<String> rowkeys) throws IOException, SolrServerException {
if (rowkeys == null || rowkeys.size() == 0) {
return;
}
solrserver.deleteById(rowkeys);
}

public void deleteDoc(String rowkey) throws IOException, SolrServerException {

solrserver.deleteById(rowkey);
}

/**
     * 添加记录到cache，如果cache达到maxCacheCount，则提交
     */
public static void addDocToCache(VmMoney vmMoney) {
commitLock.lock();
        try {
cache.add(vmMoney);
log.info("cache commit maxCacheCount:"+maxCacheCount);
            if (cache.size() >= maxCacheCount) {
log.info("cache commit count:"+cache.size());
                new SolrWriter().inputDoc(cache);
cache.clear();
}
        } catch (Exception ex) {
log.info(ex.getMessage());
} finally {
commitLock.unlock();
}
    }

/**
     * 提交定时器
     */
static class CommitTimer extends TimerTask {
@Override
public void run() {
commitLock.lock();
            try {
if (cache.size() > 0) { //大于0则提交
log.info("timer commit count:"+cache.size());
                    new SolrWriter().inputDoc(cache);
cache.clear();
}
            } catch (Exception ex) {
log.info(ex.getMessage());
} finally {
commitLock.unlock();
}
        }
    }
}

SolrWriter的重点就在于addDocToCache方法和定时器CommitTimer，addDocToCache会在hbase每次插入数据时将记录插入缓存，并且判断是否达到上限，如果达到则将缓存内所用数据提交到solr，此外CommitTimer 则会每隔一段时间提交一次，以保证缓存内所有数据最终写入solr。

最后接下来的辅助代码，来构造缓冲器：

public class VmMoney implements Serializable{
private static final long serialVersionUID = 1L;
    private String msg_id;
    private String dest_phone;
    private String msg_Content_tmp;
    private String timestr;
    public String getMsg_id() {
return msg_id;
}
public void setMsg_id(String msg_id) {
this.msg_id = msg_id;
}
public String getDest_phone() {
return dest_phone;
}
public void setDest_phone(String dest_phone) {
this.dest_phone = dest_phone;
}
public String getMsg_Content_tmp() {
return msg_Content_tmp;
}
public void setMsg_Content_tmp(String msg_Content_tmp) {
this.msg_Content_tmp = msg_Content_tmp;
}
public String getTimestr() {
return timestr;
}
public void setTimestr(String timestr) {
this.timestr = timestr;
}
}