mapreduce创建solrcloud索引
来源:互联网 发布:vb连接mysql教程 编辑:程序博客网 时间:2024/04/29 14:30
原理图(摘自网络):
1、datasource
hdfs或者TableMapReduceUtil来scan数据(不建议HFileInputFomat方式,易丢失数据)
2、map
setup()方法中通过zkHost创建CloudSolrServer,目的是通过docId,来计算这个docId应该router到哪个shardId,关键代码为:
// copy from org.apache.solr.common.cloud.HashBasedRouter
private int sliceHash(String id) {
return Hash.murmurhash3_x86_32(id, 0, id.length(), 0);
}
// copy from org.apache.solr.common.cloud.HashBasedRouter
private Slice hashToSlice(int hash, DocCollection collection) {
for (Slice slice : collection.getSlices()) {
Range range = slice.getRange();
if (range != null && range.includes(hash))
return slice;
}
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "No slice servicing hash code " + Integer.toHexString(hash) + " in " + collection);
}
map()方法,输入(ImmutableBytesWritable , Result )输出(Text, Result)即将ImmutableBytesWritable计算shardId(比如:shard1,shard2...)
关键代码:
@Override
protected void map(ImmutableBytesWritable key, Result columns, Context context) throws IOException, InterruptedException {
String id = Bytes.toString(columns.getRow());
int sliceHash = sliceHash(id);
Slice slice = hashToSlice(sliceHash, cstate.getCollection(defau_collection));
String shardid = slice.getName();// shard1,shard2 ...
context.write(new Text(shardid), columns);
}
3、Partitioner
通过shardId,随机发散到多个reduce,即一个shard数据,多个reduce来建索引,更高效
关键代码:
/**
*numPartitions为配置一个shard对应多少个reduce
*/
@Override
public int getPartition(Text key, Result result, int numPartitions) {
initReducerTimes();
int shardId = Integer.valueOf(key.toString().substring(5))-1;// 十位
int part2 = (int) Math.round(Math.random() * (reducerTimes-1));// 个位
return reducerTimes * shardId + part2;
}
4、reduce
setup()方法中获取partition,计算shardid,通过zkHost获取cloudServer,最后ConcurrentUpdateSolrServer,这个solrServer就是针对某个shard建索引的server。
关键代码:
@Override
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
zk_host = conf.get(IndexingConfig.ZK_HOST, zk_host);
defau_collection = conf.get(IndexingConfig.DEFAULT_COLLECTION, defau_collection);
// max_segments = conf.getInt(IndexingConfig.MAX_SEGMENTS, max_segments);
document_buffer_size = conf.getInt(IndexingConfig.DOCUMENT_BUFFER_SIZE, document_buffer_size);
solr_thread_count = conf.getInt(IndexingConfig.SOLR_THREAD_COUNT, solr_thread_count);
id_field = conf.get(IndexingConfig.ID_FIELD, id_field);
logger.info("初始化server...");
try {
cloudServer = new CloudSolrServer(zk_host);
cloudServer.setDefaultCollection(defau_collection);
SolrPingResponse response = cloudServer.ping();
logger.info("status:" + response.getStatus());
if (response.getStatus() != 0) {
logger.error("服务有问题");
throw new RuntimeException("服务响应不正常");
}
logger.info(cloudServer.ping().toString());
} catch (Exception e) {
logger.error("", e);
throw new RuntimeException("配置好solr服务");
}
logger.info("初始化结束");
// 取得partition
partition = conf.get("mapred.task.partition");// 0~79
// 根据partition来启动不同的solrServer
shardid = getShardId(partition);
ZkStateReader reader = cloudServer.getZkStateReader();
try {
String url = reader.getLeaderUrl(defau_collection, "shard" + shardid, 3000);
//solrServer = new ConcurrentUpdateSolrServer(url, document_buffer_size, solr_thread_count);
solrServer = new HttpSolrServer(url);
System.out.println("###url: " + url);
} catch (InterruptedException e) {
e.printStackTrace();
} catch (KeeperException e) {
e.printStackTrace();
}
initTransformers();
initFieldBoost();
// transformers init();
for (TransformerMR t :transformers) {
t.init(conf);
}
}
总结:重点在于直接将doc router到正确的shard,而不经过solrCloud的内部router(实际上是copy其router源码);并且每个shard对应多个reduce,随机发散,增加一层并发。
注意:
ConcurrentUpdateSolrServer使用了 final BlockingQueue<UpdateRequest> queue;
在高并发过程中(mapreduce建索引),会造成阻塞问题,丢失数据。
原因可能有2个
1、queue满了,写不进去,请求连接中断
2、queue还有数据未写入硬盘,直接执行optimize()操作
换成HttpSolrServer就没有问题了,solrServer = new HttpSolrServer(url);
0 0
- mapreduce创建solrcloud索引
- mapreduce创建solrcloud索引
- mapreduce创建solrcloud索引
- solrcloud有路由的mapreduce索引程序
- SolrCloud如何创建跟新索引信息
- MapReduce并行创建反向索引
- solr入门之创建solrcloud的collection索引集合
- SolrCloud Collection 创建
- solrCloud 索引更新逻辑学习笔记
- solrCloud Update Request Handling 更新索引流程
- Solr分布式索引SolrCloud原理总结
- Solr分布式索引SolrCloud原理总结
- solrCloud 索引更新逻辑学习笔记
- solrCloud..
- solrcloud
- SolrCloud
- SolrCloud
- solrCloud
- GDB简单调试介绍
- Android studio 导入jar, so文件和引入library文件
- 1024caoliushequ最新dizhi
- perl 运算符
- DomainDataSource的自动刷新
- mapreduce创建solrcloud索引
- PHP伪静态路径使用正则过滤重复参数示例--(网站名泉州)
- hdoj 4009 Transfer water 最小树形图
- FFFmpeg and AAC Encoding Guide
- 汇编--子程序调用之输出回车
- sublime text Package Control 安装
- WordPress非插件版post_view(文章点击量统计代码)
- java ——long类型判空
- 爱看不看