MR(MapReduce)查询hbase数据－用到TableMapper和Scan

来源：互联网发布：手机版project软件编辑：程序博客网时间：2024/06/07 02:07

7.2. HBase MapReduce Examples

7.2.1. HBase MapReduce Read Example

The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined as follows...

Configuration config = HBaseConfiguration.create();Job job = new Job(config, "ExampleRead");job.setJarByClass(MyReadJob.class);     // class that contains mapperScan scan = new Scan();scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false);  // don't set to true for MR jobs// set other scan attrs...TableMapReduceUtil.initTableMapperJob(  tableName,        // input HBase table name  scan,             // Scan instance to control CF and attribute selection  MyMapper.class,   // mapper  null,             // mapper output key  null,             // mapper output value  job);job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapperboolean b = job.waitForCompletion(true);if (!b) {  throw new IOException("error with job!");}

...and the mapper instance would extend TableMapper...

public static class MyMapper extends TableMapper<Text, Text> {  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {    // process data for the row from the Result instance.   }}

7.2.2. HBase MapReduce Read/Write Example

The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.

Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleReadWrite");job.setJarByClass(MyReadWriteJob.class);    // class that contains mapperScan scan = new Scan();scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false);  // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable,      // input tablescan,          // Scan instance to control CF and attribute selectionMyMapper.class,   // mapper classnull,          // mapper output keynull,          // mapper output valuejob);TableMapReduceUtil.initTableReducerJob(targetTable,      // output tablenull,             // reducer classjob);job.setNumReduceTasks(0);boolean b = job.waitForCompletion(true);if (!b) {    throw new IOException("error with job!");}

An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier.

The following is the example mapper, which will create a Put and matching the input Result and emit it. Note: this is what the CopyTable utility does.

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {// this example is just copying the data from the source table...   context.write(row, resultToPut(row,value));   }  private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {  Put put = new Put(key.get()); for (KeyValue kv : result.raw()) {put.add(kv);}return put;   }}

There isn't actually a reducer step, so TableOutputFormat takes care of sending the Put to the target table.

This is just an example, developers could choose not to use TableOutputFormat and connect to the target table themselves.

7.2.3. HBase MapReduce Read/Write Example With Multi-Table Output

TODO: example for MultiTableOutputFormat.

7.2.4. HBase MapReduce Summary to HBase Example

The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.

Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleSummary");job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducerScan scan = new Scan();scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false);  // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable,        // input tablescan,               // Scan instance to control CF and attribute selectionMyMapper.class,     // mapper classText.class,         // mapper output keyIntWritable.class,  // mapper output valuejob);TableMapReduceUtil.initTableReducerJob(targetTable,        // output tableMyTableReducer.class,    // reducer classjob);job.setNumReduceTasks(1);   // at least one, adjust as requiredboolean b = job.waitForCompletion(true);if (!b) {throw new IOException("error with job!");}

In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter.

public static class MyMapper extends TableMapper<Text, IntWritable>  {public static final byte[] CF = "cf".getBytes();public static final byte[] ATTR1 = "attr1".getBytes();private final IntWritable ONE = new IntWritable(1);   private Text text = new Text();   public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {        String val = new String(value.getValue(CF, ATTR1));          text.set(val);     // we can only emit Writables...        context.write(text, ONE);   }}

In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put.

public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable>  {public static final byte[] CF = "cf".getBytes();public static final byte[] COUNT = "count".getBytes(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {    int i = 0;    for (IntWritable val : values) {    i += val.get();    }    Put put = new Put(Bytes.toBytes(key.toString()));    put.add(CF, COUNT, Bytes.toBytes(i));    context.write(null, put);   }}

7.2.5. HBase MapReduce Summary to File Example

This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.

Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleSummaryToFile");job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducerScan scan = new Scan();scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false);  // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable,        // input tablescan,               // Scan instance to control CF and attribute selectionMyMapper.class,     // mapper classText.class,         // mapper output keyIntWritable.class,  // mapper output valuejob);job.setReducerClass(MyReducer.class);    // reducer classjob.setNumReduceTasks(1);    // at least one, adjust as requiredFileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories as requiredboolean b = job.waitForCompletion(true);if (!b) {throw new IOException("error with job!");}

As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.

 public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int i = 0;for (IntWritable val : values) {i += val.get();}context.write(key, new IntWritable(i));}}

7.2.6. HBase MapReduce Summary to HBase Without Reducer

It is also possible to perform summaries without a reducer - if you use HBase as the reducer.

An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.

In the end, the summary results are in HBase.

7.2.7. HBase MapReduce Summary to RDBMS

Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer. The setup method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.

It is critical to understand that number of reducers for the job affects the summarization implementation, and you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.

 public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {private Connection c = null;public void setup(Context context) {  // create DB connection...  }public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {// do summarization// in this example the keys are Text, but this is just an example}public void cleanup(Context context) {  // close db connection  }}

In the end, the summary results are written to your RDBMS table/s.

首先，可以设置scan的startRow, stopRow, filter等属性。于是两种方案：

1.设置scan的filter，然后执行mapper，再reducer成一份结果

2.不用filter过滤，将filter做的事传给mapper做

进行了测试，前者在执行较少量scan记录的时候效率较后者高，但是执行的scan数量多了，便容易导致超时无返回而退出的情况。而为了实现后者，学会了如何向mapper任务中传递参数，走了一点弯路。

最后的一点思考是，用后者效率仍然不高，即便可用前者时效率也不高，因为默认的tablemapper是将对一个region的scan任务放在了一个mapper里，而我一个region有2G多，而我查的数据只占七八个region。于是，想能不能不以region为单位算做mapper，如果不能改，那只有用MR直接操作HBase底层HDFS文件了，这个，…，待研究。

上代码（为了保密，将表名啊，列名列族名啊都改了一下，有改漏的，大家当做没看见啊，另：主要供大家参考下方法，即用mr来查询海量hbase数据，还有如何向mapper传参数）：

[java] view plaincopy
package mapreduce.hbase;  
  
import java.io.IOException;  
  
import mapreduce.HDFS_File;  
  
import org.apache.commons.logging.Log;  
import org.apache.commons.logging.LogFactory;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.hbase.HBaseConfiguration;  
import org.apache.hadoop.hbase.client.Result;  
import org.apache.hadoop.hbase.client.Scan;  
import org.apache.hadoop.hbase.filter.Filter;  
import org.apache.hadoop.hbase.filter.FilterList;  
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;  
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;  
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;  
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;  
import org.apache.hadoop.hbase.mapreduce.TableMapper;  
import org.apache.hadoop.hbase.util.Bytes;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper.Context;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  
/** 
 * 用MR对HBase进行查找，给出Scan的条件诸如startkey endkey;以及filters用来过滤掉不符合条件的记录 LicenseTable 
 * 的 RowKey 201101010000000095\xE5\xAE\x81WDTLBZ 
 *  
 * @author Wallace 
 *  
 */  
@SuppressWarnings("unused")  
public class MRSearchAuto {  
    private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);  
  
    private static String TABLE_NAME = "tablename";  
    private static byte[] FAMILY_NAME = Bytes.toBytes("cfname");  
    private static byte[][] QUALIFIER_NAME = { Bytes.toBytes("col1"),  
            Bytes.toBytes("col2"), Bytes.toBytes("col3") };  
  
    public static class SearchMapper extends  
            TableMapper<ImmutableBytesWritable, Text> {  
        private int numOfFilter = 0;  
  
        private Text word = new Text();  
        String[] strConditionStrings = new String[]{"","",""}/* { "新C87310", "10", "2" } */;  
  
        /* 
         * private void init(Configuration conf) throws IOException, 
         * InterruptedException { strConditionStrings[0] = 
         * conf.get("search.license").trim(); strConditionStrings[1] = 
         * conf.get("search.carColor").trim(); strConditionStrings[2] = 
         * conf.get("search.direction").trim(); LOG.info("license: " + 
         * strConditionStrings[0]); } 
         */  
        protected void setup(Context context) throws IOException,  
                InterruptedException {  
            strConditionStrings[0] = context.getConfiguration().get("search.license").trim();  
            strConditionStrings[1] = context.getConfiguration().get("search.color").trim();  
            strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();  
        }  
  
        protected void map(ImmutableBytesWritable key, Result value,  
                Context context) throws InterruptedException, IOException {  
            String string = "";  
            String tempString;  
  
            /**/  
            for (int i = 0; i < 1; i++) {  
                // /在此map里进行filter的功能  
                tempString = Text.decode(value.getValue(FAMILY_NAME,  
                        QUALIFIER_NAME[i]));  
                if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {  
                    LOG.info("新C87310. conf: " + strConditionStrings[0]);  
                    if (tempString.equals(strConditionStrings[i])) {  
                        string = string + tempString + " ";  
                    } else {  
                        return;  
                    }  
                }  
  
                else {  
                    return;  
                }  
            }  
  
            word.set(string);  
            context.write(null, word);  
        }  
    }  
  
    public void searchHBase(int numOfDays) throws IOException,  
            InterruptedException, ClassNotFoundException {  
        long startTime;  
        long endTime;  
  
        Configuration conf = HBaseConfiguration.create();  
        conf.set("hbase.zookeeper.quorum", "node2,node3,node4");  
        conf.set("fs.default.name", "hdfs://node1");  
        conf.set("mapred.job.tracker", "node1:54311");  
        /* 
         * 传递参数给map 
         */  
        conf.set("search.license", "新C87310");  
        conf.set("search.color", "10");  
        conf.set("search.direction", "2");  
  
        Job job = new Job(conf, "MRSearchHBase");  
        System.out.println("search.license: " + conf.get("search.license"));  
        job.setNumReduceTasks(0);  
        job.setJarByClass(MRSearchAuto.class);  
        Scan scan = new Scan();  
        scan.addFamily(FAMILY_NAME);  
        byte[] startRow = Bytes.toBytes("2011010100000");  
        byte[] stopRow;  
        switch (numOfDays) {  
        case 1:  
            stopRow = Bytes.toBytes("2011010200000");  
            break;  
        case 10:  
            stopRow = Bytes.toBytes("2011011100000");  
            break;  
        case 30:  
            stopRow = Bytes.toBytes("2011020100000");  
            break;  
        case 365:  
            stopRow = Bytes.toBytes("2012010100000");  
            break;  
        default:  
            stopRow = Bytes.toBytes("2011010101000");  
        }  
        // 设置开始和结束key  
        scan.setStartRow(startRow);  
        scan.setStopRow(stopRow);  
  
        TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan,  
                SearchMapper.class, ImmutableBytesWritable.class, Text.class,  
                job);  
        Path outPath = new Path("searchresult");  
        HDFS_File file = new HDFS_File();  
        file.DelFile(conf, outPath.getName(), true); // 若已存在，则先删除  
        FileOutputFormat.setOutputPath(job, outPath);// 输出结果  
  
        startTime = System.currentTimeMillis();  
        job.waitForCompletion(true);  
        endTime = System.currentTimeMillis();  
        System.out.println("Time used: " + (endTime - startTime));  
        System.out.println("startRow:" + Text.decode(startRow));  
        System.out.println("stopRow: " + Text.decode(stopRow));  
    }  
  
    public static void main(String args[]) throws IOException,  
            InterruptedException, ClassNotFoundException {  
        MRSearchAuto mrSearchAuto = new MRSearchAuto();  
        int numOfDays = 1;  
        if (args.length == 1)  
            numOfDays = Integer.valueOf(args[0]);  
        System.out.println("Num of days: " + numOfDays);  
        mrSearchAuto.searchHBase(numOfDays);  
    }  
}  

开始时，我是在外面conf.set了传入的参数，而在mapper的init(Configuration)里get参数并赋给mapper对象。

将参数传给map运行时结果不对
for (int i = 0; i < 1; i++) {
    // /在此map里进行filter的功能
    tempString = Text.decode(value.getValue(FAMILY_NAME,
      QUALIFIER_NAME[i]));
    if (tempString.equals(/*strConditionStrings[i]*/"新C87310"))
     string = string + tempString + " ";
    else {
     return;
    }
   }
如果用下面的mapper的init获取conf传来的参数，然后在上面map函数里进行调用，结果便不对了。
直接指定值时和参数传过来相同的值时，其output的结果分别为1条和0条。
  private void init(Configuration conf) throws IOException,
    InterruptedException {
   strConditionStrings[0] = conf.get("search.licenseNumber").trim();
   strConditionStrings[1] = conf.get("search.carColor").trim();
   strConditionStrings[2] = conf.get("search.direction").trim();
  }
加了个日志写
private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);
init()函数里：
LOG.info("license: " + strConditionStrings[0]);
map里
if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {
  LOG.info("新C87310. conf: " + strConditionStrings[0]);
然后在网页 namenode:50030上看任务，最终定位到哪台机器执行了那个map，然后看日志
mapreduce.hbase.TestMRHBase: 新C87310. conf: null
在conf.set之后我也写了下，那时正常，但是在map里却是null了，而在map类的init函数打印的却没有打印。
因此，问题应该是：
map类的init()函数没有执行到！
于是init()的获取conf中参数值并赋给map里变量的操作便未执行，同时打印日志也未执行。
OK！看怎么解决
放在setup里获取
  protected void setup(Context context) throws IOException,
    InterruptedException {
  // strConditionStrings[0] = context.getConfiguration().get("search.license").trim();
  // strConditionStrings[1] = context.getConfiguration().get("search.color").trim();
  // strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();
  }
报错
12/01/12 11:21:56 INFO mapred.JobClient: map 0% reduce 0%
12/01/12 11:22:03 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_0, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

attempt_201201100941_0071_m_000000_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201201100941_0071_m_000000_0: log4j:WARN Please initialize the log4j system properly.
12/01/12 11:22:09 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_1, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
然后将setup里的东西注释掉，无错，错误应该在context上，进一步确认，在里面不用context，直接赋值，有结果，好！
说明是context的事了，NullPointerException,应该是context.getConfiguration().get("search.license")这些中有一个是null的。
突然想起来，改了下get时候的属性，而set时候没改，于是不对应，于是context.getConfiguration().get("search.color")及下面的一项都是null，null.trim()报的异常。
  conf.set("search.license", "新C87310");
  conf.set("search.color", "10");
  conf.set("search.direction", "2");
修改后，问题解决。
实现了向map中传参数