Mapreduce 结果写入Hbase两种方法对比

来源：互联网发布：网络调试助手编辑：程序博客网时间：2024/06/05 11:45

由于能力有限，对性能的评价皆出于主观感受，见谅。

方法一：通过Hbase提供的写入接口

在setup中配置Hbase信息，检测表是否存在，不存在创建表；在reduce函数中，调用table.put(put1)方法把结果写入到Hbase中

public class hbaseStatisticsReducer extends Reducer<Text, Text, Text, Text> {public static String tablename = "statistics";public static String[] cfs = { "data" };public static Configuration conf = new Configuration();@Overrideprotected void setup(Context context) throws IOException {conf = context.getConfiguration();conf.set("hbase.rootdir", "hdfs://localhost:9000/hbase");conf.set("hbase.zookeeper.quorum", "localhost");conf.set("hbase.zookeeper.property.clientPort", "2181");HBaseAdmin admin = new HBaseAdmin(conf);if (admin.tableExists(tablename)) {} else {HTableDescriptor tableDesc = new HTableDescriptor(tablename);for (int i = 0; i < cfs.length; i++) {tableDesc.addFamily(new HColumnDescriptor(cfs[i]));}admin.createTable(tableDesc);}}@Overrideprotected void reduce(Text item, Iterable<Text> input, Context context)throws IOException, InterruptedException {HTable table = new HTable(conf, tablename);int times = 0;long sum = 0l;for (Text tmp : input) {String[] tmpstr = tmp.toString().split(StatisticsMapper.separator);sum += Long.parseLong(tmpstr[0]);times += Integer.parseInt(tmpstr[1]);}Put put1 = new Put(Bytes.toBytes(item.toString()));put1.add(Bytes.toBytes(cfs[0]), Bytes.toBytes("sum"),Bytes.toBytes("" + sum));table.put(put1);}}

效率低下。

方法二：通过Hbase提供的reduce接口

在驱动程序中设置Hbase的相关属性

conf.set("hbase.rootdir", "hdfs://172.17.238.152:9000/hbase");conf.set("hbase.zookeeper.quorum", "172.17.238.151");conf.set("hbase.zookeeper.property.clientPort", "2181");

以及通过

TableMapReduceUtil.initTableReducerJob(AnalysisMain.TableName, AnalysisIntoHBaseReducer.class, job);

设置表名、Reducer类、job对象。

Reducer继承TableReducer，它默认输出为 Hbase的Put对象，并插入到对应的表中。此程序需要提前建表、列簇等，有待改进。效率也不高！

public class AnalysisIntoHBaseReducer extendsTableReducer<Text, Text, ImmutableBytesWritable> {@Overridepublic void reduce(Text item, Iterable<Text> input, Context context)throws InterruptedException, IOException {int times = 0;long sum = 0l;for (Text tmp : input) {String[] tmpstr = tmp.toString().split(AnalysisMain.separator);sum += Long.parseLong(tmpstr[0]);times += Integer.parseInt(tmpstr[1]);}Put put1 = new Put(Bytes.toBytes(item.toString()));put1.add(Bytes.toBytes("data"), Bytes.toBytes("sum"),Bytes.toBytes("" + sum));context.write(new ImmutableBytesWritable(item.toString().getBytes()),put1);Put put2 = new Put(Bytes.toBytes(item.toString()));put2.add(Bytes.toBytes("data"), Bytes.toBytes("times"),Bytes.toBytes("" + times));context.write(new ImmutableBytesWritable(item.toString().getBytes()),put2);}}

该任务的部分信息

Map input records10,000,000010,000,000Reduce input records07,521,2637,521,263Reduce input groups0999,962999,962

Reduce output records01,999,9241,999,924reduce用时约为11分钟（同时6个task运行，共12个）

但比方法一更优！

相应的：Map对Hbase的操作（从中读书据）也有对应的tableMapper供继承（速度较快）。

TableMapReduceUtil.initTableMapperJob("analysis", scan, hbaseMapper.class, Text.class, Text.class, job);

public class hbaseMapper extends TableMapper<Text, Text> {@Overrideprotected void map(ImmutableBytesWritable key, Result value, Context context)throws InterruptedException, IOException {String sum = new String(value.getValue(testMain.family, testMain.sum));String times = new String(value.getValue(testMain.family,testMain.times));String row = new String(key.get());String[] tokens = row.split(testMain.separator);String newRow = tokens[0] + testMain.separator + tokens[1]+ testMain.separator + tokens[4];context.write(new Text(newRow), new Text(sum + testMain.separator+ times));}}

key值中可读取row key信息。

对Hbase中信息筛选有两种方法。

1、对scan进行设置，只取出需要的信息

2、读出所有信息，在map中处理

两种方法未测试（以上程序用了全部信息），性能未知。不负责任引用

前者在执行较少量scan记录的时候效率较后者高，但是执行的scan数量多了，便容易导致超时无返回而退出的情况。最后的一点思考是，用后者效率仍然不高，即便可用前者时效率也不高，因为默认的tablemapper是将对一个region的scan任务放在了一个mapper里，而我一个region有2G多，而我查的数据只占七八个region。于是，想能不能不以region为单位算做mapper，如果不能改，那只有用MR直接操作HBase底层HDFS文件了，这个，…，待研究。