MR(MapReduce)查询hbase数据-用到TableMapper和Scan
来源:互联网 发布:手机版project软件 编辑:程序博客网 时间:2024/06/07 02:07
7.2. HBase MapReduce Examples
7.2.1. HBase MapReduce Read Example
The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined as follows...
Configuration config = HBaseConfiguration.create();Job job = new Job(config, "ExampleRead");job.setJarByClass(MyReadJob.class); // class that contains mapperScan scan = new Scan();scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false); // don't set to true for MR jobs// set other scan attrs...TableMapReduceUtil.initTableMapperJob( tableName, // input HBase table name scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper null, // mapper output key null, // mapper output value job);job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapperboolean b = job.waitForCompletion(true);if (!b) { throw new IOException("error with job!");}
...and the mapper instance would extend TableMapper...
public static class MyMapper extends TableMapper<Text, Text> { public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { // process data for the row from the Result instance. }}
7.2.2. HBase MapReduce Read/Write Example
The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.
Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleReadWrite");job.setJarByClass(MyReadWriteJob.class); // class that contains mapperScan scan = new Scan();scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false); // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable, // input tablescan, // Scan instance to control CF and attribute selectionMyMapper.class, // mapper classnull, // mapper output keynull, // mapper output valuejob);TableMapReduceUtil.initTableReducerJob(targetTable, // output tablenull, // reducer classjob);job.setNumReduceTasks(0);boolean b = job.waitForCompletion(true);if (!b) { throw new IOException("error with job!");}
An explanation is required of what TableMapReduceUtil
is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable
and reducer value to Writable
. These could be set by the programmer on the job and conf, but TableMapReduceUtil
tries to make things easier.
The following is the example mapper, which will create a Put
and matching the input Result
and emit it. Note: this is what the CopyTable utility does.
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {// this example is just copying the data from the source table... context.write(row, resultToPut(row,value)); } private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) {put.add(kv);}return put; }}
There isn't actually a reducer step, so TableOutputFormat
takes care of sending the Put
to the target table.
This is just an example, developers could choose not to use TableOutputFormat
and connect to the target table themselves.
7.2.3. HBase MapReduce Read/Write Example With Multi-Table Output
TODO: example for MultiTableOutputFormat
.
7.2.4. HBase MapReduce Summary to HBase Example
The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.
Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleSummary");job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducerScan scan = new Scan();scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false); // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable, // input tablescan, // Scan instance to control CF and attribute selectionMyMapper.class, // mapper classText.class, // mapper output keyIntWritable.class, // mapper output valuejob);TableMapReduceUtil.initTableReducerJob(targetTable, // output tableMyTableReducer.class, // reducer classjob);job.setNumReduceTasks(1); // at least one, adjust as requiredboolean b = job.waitForCompletion(true);if (!b) {throw new IOException("error with job!");}
In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable
represents an instance counter.
public static class MyMapper extends TableMapper<Text, IntWritable> {public static final byte[] CF = "cf".getBytes();public static final byte[] ATTR1 = "attr1".getBytes();private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(CF, ATTR1)); text.set(val); // we can only emit Writables... context.write(text, ONE); }}
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put
.
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {public static final byte[] CF = "cf".getBytes();public static final byte[] COUNT = "count".getBytes(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(CF, COUNT, Bytes.toBytes(i)); context.write(null, put); }}
7.2.5. HBase MapReduce Summary to File Example
This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
Configuration config = HBaseConfiguration.create();Job job = new Job(config,"ExampleSummaryToFile");job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducerScan scan = new Scan();scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobsscan.setCacheBlocks(false); // don't set to true for MR jobs// set other scan attrsTableMapReduceUtil.initTableMapperJob(sourceTable, // input tablescan, // Scan instance to control CF and attribute selectionMyMapper.class, // mapper classText.class, // mapper output keyIntWritable.class, // mapper output valuejob);job.setReducerClass(MyReducer.class); // reducer classjob.setNumReduceTasks(1); // at least one, adjust as requiredFileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as requiredboolean b = job.waitForCompletion(true);if (!b) {throw new IOException("error with job!");}As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int i = 0;for (IntWritable val : values) {i += val.get();}context.write(key, new IntWritable(i));}}
7.2.6. HBase MapReduce Summary to HBase Without Reducer
It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup
method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.
In the end, the summary results are in HBase.
7.2.7. HBase MapReduce Summary to RDBMS
Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer. The setup
method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.
It is critical to understand that number of reducers for the job affects the summarization implementation, and you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private Connection c = null;public void setup(Context context) { // create DB connection... }public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {// do summarization// in this example the keys are Text, but this is just an example}public void cleanup(Context context) { // close db connection }}
In the end, the summary results are written to your RDBMS table/s.
首先,可以设置scan的startRow, stopRow, filter等属性。于是两种方案:
1.设置scan的filter,然后执行mapper,再reducer成一份结果
2.不用filter过滤,将filter做的事传给mapper做
进行了测试,前者在执行较少量scan记录的时候效率较后者高,但是执行的scan数量多了,便容易导致超时无返回而退出的情况。而为了实现后者,学会了如何向mapper任务中传递参数,走了一点弯路。
最后的一点思考是,用后者效率仍然不高,即便可用前者时效率也不高,因为默认的tablemapper是将对一个region的scan任务放在了一个mapper里,而我一个region有2G多,而我查的数据只占七八个region。于是,想能不能不以region为单位算做mapper,如果不能改,那只有用MR直接操作HBase底层HDFS文件了,这个,…,待研究。
上代码(为了保密,将表名啊,列名列族名啊都改了一下,有改漏的,大家当做没看见啊,另:主要供大家参考下方法,即用mr来查询海量hbase数据,还有如何向mapper传参数):
- package mapreduce.hbase;
- import java.io.IOException;
- import mapreduce.HDFS_File;
- import org.apache.commons.logging.Log;
- import org.apache.commons.logging.LogFactory;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hbase.HBaseConfiguration;
- import org.apache.hadoop.hbase.client.Result;
- import org.apache.hadoop.hbase.client.Scan;
- import org.apache.hadoop.hbase.filter.Filter;
- import org.apache.hadoop.hbase.filter.FilterList;
- import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
- import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
- import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
- import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
- import org.apache.hadoop.hbase.mapreduce.TableMapper;
- import org.apache.hadoop.hbase.util.Bytes;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper.Context;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- /**
- * 用MR对HBase进行查找,给出Scan的条件诸如startkey endkey;以及filters用来过滤掉不符合条件的记录 LicenseTable
- * 的 RowKey 201101010000000095\xE5\xAE\x81WDTLBZ
- *
- * @author Wallace
- *
- */
- @SuppressWarnings("unused")
- public class MRSearchAuto {
- private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);
- private static String TABLE_NAME = "tablename";
- private static byte[] FAMILY_NAME = Bytes.toBytes("cfname");
- private static byte[][] QUALIFIER_NAME = { Bytes.toBytes("col1"),
- Bytes.toBytes("col2"), Bytes.toBytes("col3") };
- public static class SearchMapper extends
- TableMapper<ImmutableBytesWritable, Text> {
- private int numOfFilter = 0;
- private Text word = new Text();
- String[] strConditionStrings = new String[]{"","",""}/* { "新C87310", "10", "2" } */;
- /*
- * private void init(Configuration conf) throws IOException,
- * InterruptedException { strConditionStrings[0] =
- * conf.get("search.license").trim(); strConditionStrings[1] =
- * conf.get("search.carColor").trim(); strConditionStrings[2] =
- * conf.get("search.direction").trim(); LOG.info("license: " +
- * strConditionStrings[0]); }
- */
- protected void setup(Context context) throws IOException,
- InterruptedException {
- strConditionStrings[0] = context.getConfiguration().get("search.license").trim();
- strConditionStrings[1] = context.getConfiguration().get("search.color").trim();
- strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();
- }
- protected void map(ImmutableBytesWritable key, Result value,
- Context context) throws InterruptedException, IOException {
- String string = "";
- String tempString;
- /**/
- for (int i = 0; i < 1; i++) {
- // /在此map里进行filter的功能
- tempString = Text.decode(value.getValue(FAMILY_NAME,
- QUALIFIER_NAME[i]));
- if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {
- LOG.info("新C87310. conf: " + strConditionStrings[0]);
- if (tempString.equals(strConditionStrings[i])) {
- string = string + tempString + " ";
- } else {
- return;
- }
- }
- else {
- return;
- }
- }
- word.set(string);
- context.write(null, word);
- }
- }
- public void searchHBase(int numOfDays) throws IOException,
- InterruptedException, ClassNotFoundException {
- long startTime;
- long endTime;
- Configuration conf = HBaseConfiguration.create();
- conf.set("hbase.zookeeper.quorum", "node2,node3,node4");
- conf.set("fs.default.name", "hdfs://node1");
- conf.set("mapred.job.tracker", "node1:54311");
- /*
- * 传递参数给map
- */
- conf.set("search.license", "新C87310");
- conf.set("search.color", "10");
- conf.set("search.direction", "2");
- Job job = new Job(conf, "MRSearchHBase");
- System.out.println("search.license: " + conf.get("search.license"));
- job.setNumReduceTasks(0);
- job.setJarByClass(MRSearchAuto.class);
- Scan scan = new Scan();
- scan.addFamily(FAMILY_NAME);
- byte[] startRow = Bytes.toBytes("2011010100000");
- byte[] stopRow;
- switch (numOfDays) {
- case 1:
- stopRow = Bytes.toBytes("2011010200000");
- break;
- case 10:
- stopRow = Bytes.toBytes("2011011100000");
- break;
- case 30:
- stopRow = Bytes.toBytes("2011020100000");
- break;
- case 365:
- stopRow = Bytes.toBytes("2012010100000");
- break;
- default:
- stopRow = Bytes.toBytes("2011010101000");
- }
- // 设置开始和结束key
- scan.setStartRow(startRow);
- scan.setStopRow(stopRow);
- TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan,
- SearchMapper.class, ImmutableBytesWritable.class, Text.class,
- job);
- Path outPath = new Path("searchresult");
- HDFS_File file = new HDFS_File();
- file.DelFile(conf, outPath.getName(), true); // 若已存在,则先删除
- FileOutputFormat.setOutputPath(job, outPath);// 输出结果
- startTime = System.currentTimeMillis();
- job.waitForCompletion(true);
- endTime = System.currentTimeMillis();
- System.out.println("Time used: " + (endTime - startTime));
- System.out.println("startRow:" + Text.decode(startRow));
- System.out.println("stopRow: " + Text.decode(stopRow));
- }
- public static void main(String args[]) throws IOException,
- InterruptedException, ClassNotFoundException {
- MRSearchAuto mrSearchAuto = new MRSearchAuto();
- int numOfDays = 1;
- if (args.length == 1)
- numOfDays = Integer.valueOf(args[0]);
- System.out.println("Num of days: " + numOfDays);
- mrSearchAuto.searchHBase(numOfDays);
- }
- }
开始时,我是在外面conf.set了传入的参数,而在mapper的init(Configuration)里get参数并赋给mapper对象。
将参数传给map运行时结果不对
for (int i = 0; i < 1; i++) {
// /在此map里进行filter的功能
tempString = Text.decode(value.getValue(FAMILY_NAME,
QUALIFIER_NAME[i]));
if (tempString.equals(/*strConditionStrings[i]*/"新C87310"))
string = string + tempString + " ";
else {
return;
}
}
如果用下面的mapper的init获取conf传来的参数,然后在上面map函数里进行调用,结果便不对了。
直接指定值时和参数传过来相同的值时,其output的结果分别为1条和0条。
private void init(Configuration conf) throws IOException,
InterruptedException {
strConditionStrings[0] = conf.get("search.licenseNumber").trim();
strConditionStrings[1] = conf.get("search.carColor").trim();
strConditionStrings[2] = conf.get("search.direction").trim();
}
加了个日志写
private static final Log LOG = LogFactory.getLog(MRSearchAuto.class);
init()函数里:
LOG.info("license: " + strConditionStrings[0]);
map里
if (tempString.equals(/* strConditionStrings[i] */"新C87310")) {
LOG.info("新C87310. conf: " + strConditionStrings[0]);
然后在网页 namenode:50030上看任务,最终定位到哪台机器执行了那个map,然后看日志
mapreduce.hbase.TestMRHBase: 新C87310. conf: null
在conf.set之后我也写了下,那时正常,但是在map里却是null了,而在map类的init函数打印的却没有打印。
因此,问题应该是:
map类的init()函数没有执行到!
于是init()的获取conf中参数值并赋给map里变量的操作便未执行,同时打印日志也未执行。
OK!看怎么解决
放在setup里获取
protected void setup(Context context) throws IOException,
InterruptedException {
// strConditionStrings[0] = context.getConfiguration().get("search.license").trim();
// strConditionStrings[1] = context.getConfiguration().get("search.color").trim();
// strConditionStrings[2] = context.getConfiguration().get("search.direction").trim();
}
报错
12/01/12 11:21:56 INFO mapred.JobClient: map 0% reduce 0%
12/01/12 11:22:03 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_0, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201100941_0071_m_000000_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201201100941_0071_m_000000_0: log4j:WARN Please initialize the log4j system properly.
12/01/12 11:22:09 INFO mapred.JobClient: Task Id : attempt_201201100941_0071_m_000000_1, Status : FAILED
java.lang.NullPointerException
at mapreduce.hbase.MRSearchAuto$SearchMapper.setup(MRSearchAuto.java:66)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
然后将setup里的东西注释掉,无错,错误应该在context上,进一步确认,在里面不用context,直接赋值,有结果,好!
说明是context的事了,NullPointerException,应该是context.getConfiguration().get("search.license")这些中有一个是null的。
突然想起来,改了下get时候的属性,而set时候没改,于是不对应,于是context.getConfiguration().get("search.color")及下面的一项都是null,null.trim()报的异常。
conf.set("search.license", "新C87310");
conf.set("search.color", "10");
conf.set("search.direction", "2");
修改后,问题解决。
实现了向map中传参数
- MR(MapReduce)查询hbase数据-用到TableMapper和Scan
- 用MR(MapReduce)查询hbase数据-用到TableMapper和Scan
- [JAVA][DB]用MR(MapReduce)查询hbase数据(Mapper参数传递)-用到TableMapper和Scan
- 用MR(MapReduce)查询hbase数据-用到TableMapper和Scan
- HBase中的MapReduce 使用多个Scan和多个表
- HBase shell scan 模糊查询
- HBase shell scan 模糊查询
- HBase shell scan 模糊查询
- HBase shell scan 模糊查询
- HBase shell scan 模糊查询
- HBASE--数据操作,MapReduce
- 基于HBase过滤器MultiRowRangeFilter和mapreduce对opentsdb进行查询
- Hbase根据rowkey利用scan查询
- hbase的查询scan功能注意点
- java spark hbase scan过滤查询
- Hbase scan通过rowkey条件查询
- MR之wc数据写入Hbase
- HBase-scan API 通过scan读取表中数据
- 不安装oracle客户端也可以使用pl/sql developer
- 英语发英好站
- java读文件操作需要注意的地方
- 应用 memcached 提升站点性能
- dll编程学习日志
- MR(MapReduce)查询hbase数据-用到TableMapper和Scan
- 图文并解Word插入修改删除批注
- 如何在eclipse中创建web应用_tomcat
- HardFault_Handler问题查找方法
- &和&&的区别
- remote doanload file
- linux 系统负载高 如何检查
- java 引用数据类型实现 对象的克隆
- 在Android里添加自己的log函数