hbase 源代码分析（17）MapReduce 过程

来源：互联网发布：好用的急救面膜知乎编辑：程序博客网时间：2024/05/23 18:37

这一章节主要讲解Hbase的内部的Mapreduce过程。

1）hbase 可以作为数据源，

2）hbase作为输出源

3）hbase数据转移。

1）hbase 可以作为数据源，Export.java

  public static Job createSubmittableJob(Configuration conf, String[] args)
  throws IOException {
    String tableName = args[0];
    Path outputDir = new Path(args[1]);
    Job job = new Job(conf, NAME + "_" + tableName);
    job.setJobName(NAME + "_" + tableName);
    job.setJarByClass(Export.class);
    // 定义scan。主要根据配置是否需要设置fitle ，startkey，endkey等。
    //简单 Scan s =new Scan（）
    Scan s = getConfiguredScanForJob(conf, args);
//这里会定义每一个region一个map。map的数量等于region的数量。这个map里面基本什么都没做就是读到的
//数据直接写出。
//这里会定义map的输入格式为TableInputFormat.class
    IdentityTableMapper.initJob(tableName, s, IdentityTableMapper.class, job);
    // No reducers.  Just write straight to output files.
//直接保存数据。
    job.setNumReduceTasks(0);
//输出文件
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(Result.class);
    FileOutputFormat.setOutputPath(job, outputDir); // job conf doesn't contain the conf so doesn't have a default fs.
    return job;
  }

这个mapreduce里面最重要的是怎么确定一个region 对应一个map。这就是靠TableInputFormat决定的

  @Override
  public List<InputSplit> getSplits(JobContext context) throws IOException {
    List<InputSplit> splits = super.getSplits(context);
    if ((conf.get(SHUFFLE_MAPS) != null) && "true".equals(conf.get(SHUFFLE_MAPS).toLowerCase())) {
      Collections.shuffle(splits);
    }
    return splits;
  }

这里getSplits是根据regionLocationInfo ，分区当然是startkey。根据region的数量设置map的个数，这样就可以一个region

对应一个map了。当然这里没有设置，因为没必要。

在初始化map的时候设置了combinerClass为putCombiner

@Override
  protected void reduce(K row, Iterable<Put> vals, Context context)
      throws IOException, InterruptedException {
  
    long threshold = context.getConfiguration().getLong(
        "putcombiner.row.threshold", 1L * (1<<30));
    int cnt = 0;
    long curSize = 0;
    Put put = null;
    Map<byte[], List<Cell>> familyMap = null;
    for (Put p : vals) {
      cnt++;
      if (put == null) {
        put = p;
        familyMap = put.getFamilyCellMap();
      } else {
        for (Entry<byte[], List<Cell>> entry : p.getFamilyCellMap()
            .entrySet()) {
          List<Cell> cells = familyMap.get(entry.getKey());
          List<Cell> kvs = (cells != null) ? (List<Cell>) cells : null;
          for (Cell cell : entry.getValue()) {
            KeyValue kv = KeyValueUtil.ensureKeyValueTypeForMR(cell);
            curSize += kv.heapSize();
            if (kvs != null) {
              kvs.add(kv);
            }
          }
          if (cells == null) {
            familyMap.put(entry.getKey(), entry.getValue());
          }
        }
        if (cnt % 10 == 0) context.setStatus("Combine " + cnt);
        if (curSize > threshold) {
          if (LOG.isDebugEnabled()) {
            LOG.debug(String.format("Combined %d Put(s) into %d.", cnt, 1));
          }
          context.write(row, put);
          put = null;
          curSize = 0;
          cnt = 0;
        }
      }
    }
    if (put != null) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(String.format("Combined %d Put(s) into %d.", cnt, 1));
      }
      context.write(row, put);
    }
  }

因为hbase 输出都是一个cell单元，如果一行记录包含多个列，就需要这个东西。将相同rowkey的数据放在一块。

对于reduce 根本不需，指定输出格式就行。然后就是位置。

这样Export 过程结束：

2）Import.java

这个刚好相反。需要关注reduce过程

public static Job createSubmittableJob(Configuration conf, String[] args)
  throws IOException {
    TableName tableName = TableName.valueOf(args[0]);
    conf.set(TABLE_NAME, tableName.getNameAsString());
    Path inputDir = new Path(args[1]);
    Job job = new Job(conf, NAME + "_" + tableName);
    job.setJarByClass(Importer.class);
    FileInputFormat.setInputPaths(job, inputDir);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    String hfileOutPath = conf.get(BULK_OUTPUT_CONF_KEY);
    // make sure we get the filter in the jars
    try {
      Class<? extends Filter> filter = conf.getClass(FILTER_CLASS_CONF_KEY, null, Filter.class);
      if (filter != null) {
        TableMapReduceUtil.addDependencyJarsForClasses(conf, filter);
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
//这里直接写出kv文件。因为数据量大。按region分了
    if (hfileOutPath != null && conf.getBoolean(HAS_LARGE_RESULT, false)) {
      LOG.info("Use Large Result!!");
      try (Connection conn = ConnectionFactory.createConnection(conf); 
          Table table = conn.getTable(tableName);
          RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
        HFileOutputFormat2.configureIncrementalLoad(job, table.getTableDescriptor(), regionLocator);
        job.setMapperClass(KeyValueSortImporter.class);
        job.setReducerClass(KeyValueReducer.class);
        Path outputDir = new Path(hfileOutPath);
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setMapOutputKeyClass(KeyValueWritableComparable.class);
        job.setMapOutputValueClass(KeyValue.class);
        job.getConfiguration().setClass("mapreduce.job.output.key.comparator.class", 
            KeyValueWritableComparable.KeyValueWritableComparator.class,
            RawComparator.class);
        Path partitionsPath = 
            new Path(TotalOrderPartitioner.getPartitionFile(job.getConfiguration()));
        FileSystem fs = FileSystem.get(job.getConfiguration());
        fs.deleteOnExit(partitionsPath);
        job.setPartitionerClass(KeyValueWritableComparablePartitioner.class);
        job.setNumReduceTasks(regionLocator.getStartKeys().length);
        TableMapReduceUtil.addDependencyJarsForClasses(job.getConfiguration(),
            com.google.common.base.Preconditions.class);
      }
//没有分区。
    } else if (hfileOutPath != null) {
      job.setMapperClass(KeyValueImporter.class);
      try (Connection conn = ConnectionFactory.createConnection(conf); 
          Table table = conn.getTable(tableName);
          RegionLocator regionLocator = conn.getRegionLocator(tableName)){
        job.setReducerClass(KeyValueSortReducer.class);
        Path outputDir = new Path(hfileOutPath);
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(KeyValue.class);
        HFileOutputFormat2.configureIncrementalLoad(job, table.getTableDescriptor(), regionLocator);
        TableMapReduceUtil.addDependencyJarsForClasses(job.getConfiguration(),
            com.google.common.base.Preconditions.class);
      }
    } else {
//这个直接调用内TableOutputFarmat。这样就直接调用的是put。这个少量还好，多了不行。
//具体的write见下面的代码
      // No reducers.  Just write straight to table.  Call initTableReducerJob
      // because it sets up the TableOutputFormat.
      job.setMapperClass(Importer.class);
      TableMapReduceUtil.initTableReducerJob(tableName.getNameAsString(), null, job);
      job.setNumReduceTasks(0);
    }
    return job;
  }

这个主要的就是标红的地方，定义reduce的个数，定义reduce的输出是按region来分区的。这样就ok了。

这里的partition也是按照startkey来区分的

   private static KeyValueWritableComparable[] START_KEYS = null;
    @Override
    public int getPartition(KeyValueWritableComparable key, KeyValue value,
        int numPartitions) {
      for (int i = 0; i < START_KEYS.length; ++i) {
        if (key.compareTo(START_KEYS[i]) <= 0) {
          return i;
        }
      }
      return START_KEYS.length;
    }
  
  }

 @Override
    public void write(KEY key, Mutation value)
    throws IOException {
      if (!(value instanceof Put) && !(value instanceof Delete)) {
        throw new IOException("Pass a Delete or a Put");
      }
      mutator.mutate(value);
    }

生成的kv文件怎么load到hbase里面呢，需要调用另外一个类LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
重要的东西多说一点。

然后第三种表之间copy 用CopyTable

到此结束。下面是拷贝过来的。算是总结了
 在对于大量的数据导入到Hbase中, 如果一条一条进行插入, 则太耗时了, 所以可以先采用MapReduce生成HFile文件, 然后使用BulkLoad导入hbase中. 引用:一、这种方式有很多的优点：1. 如果我们一次性入库hbase巨量数据，处理速度慢不说，还特别占用Region资源， 一个比较高效便捷的方法就是使用 “Bulk Loading”方法，即HBase提供的HFileOutputFormat类。2. 它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理，直接生成这种hdfs内存储的数据格式文件，然后上传至合适位置，即完成巨量数据快速入库的办法。配合mapreduce完成，高效便捷，而且不占用region资源，增添负载。二、这种方式也有很大的限制：1. 仅适合初次数据导入，即表内数据为空，或者每次入库表内都无数据的情况。2. HBase集群与Hadoop集群为同一集群，即HBase所基于的HDFS为生成HFile的MR的集群.

这一章节主要讲解Hbase的内部的Mapreduce过程。

1）hbase 可以作为数据源，

2）hbase作为输出源

3）hbase数据转移。

1）hbase 可以作为数据源，Export.java

  public static Job createSubmittableJob(Configuration conf, String[] args)
  throws IOException {
    String tableName = args[0];
    Path outputDir = new Path(args[1]);
    Job job = new Job(conf, NAME + "_" + tableName);
    job.setJobName(NAME + "_" + tableName);
    job.setJarByClass(Export.class);
    // 定义scan。主要根据配置是否需要设置fitle ，startkey，endkey等。
    //简单 Scan s =new Scan（）
    Scan s = getConfiguredScanForJob(conf, args);
//这里会定义每一个region一个map。map的数量等于region的数量。这个map里面基本什么都没做就是读到的
//数据直接写出。
//这里会定义map的输入格式为TableInputFormat.class
    IdentityTableMapper.initJob(tableName, s, IdentityTableMapper.class, job);
    // No reducers.  Just write straight to output files.
//直接保存数据。
    job.setNumReduceTasks(0);
//输出文件
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(Result.class);
    FileOutputFormat.setOutputPath(job, outputDir); // job conf doesn't contain the conf so doesn't have a default fs.
    return job;
  }

这个mapreduce里面最重要的是怎么确定一个region 对应一个map。这就是靠TableInputFormat决定的

  @Override
  public List<InputSplit> getSplits(JobContext context) throws IOException {
    List<InputSplit> splits = super.getSplits(context);
    if ((conf.get(SHUFFLE_MAPS) != null) && "true".equals(conf.get(SHUFFLE_MAPS).toLowerCase())) {
      Collections.shuffle(splits);
    }
    return splits;
  }

这里getSplits是根据regionLocationInfo ，分区当然是startkey。根据region的数量设置map的个数，这样就可以一个region

对应一个map了。当然这里没有设置，因为没必要。

在初始化map的时候设置了combinerClass为putCombiner

@Override
  protected void reduce(K row, Iterable<Put> vals, Context context)
      throws IOException, InterruptedException {
  
    long threshold = context.getConfiguration().getLong(
        "putcombiner.row.threshold", 1L * (1<<30));
    int cnt = 0;
    long curSize = 0;
    Put put = null;
    Map<byte[], List<Cell>> familyMap = null;
    for (Put p : vals) {
      cnt++;
      if (put == null) {
        put = p;
        familyMap = put.getFamilyCellMap();
      } else {
        for (Entry<byte[], List<Cell>> entry : p.getFamilyCellMap()
            .entrySet()) {
          List<Cell> cells = familyMap.get(entry.getKey());
          List<Cell> kvs = (cells != null) ? (List<Cell>) cells : null;
          for (Cell cell : entry.getValue()) {
            KeyValue kv = KeyValueUtil.ensureKeyValueTypeForMR(cell);
            curSize += kv.heapSize();
            if (kvs != null) {
              kvs.add(kv);
            }
          }
          if (cells == null) {
            familyMap.put(entry.getKey(), entry.getValue());
          }
        }
        if (cnt % 10 == 0) context.setStatus("Combine " + cnt);
        if (curSize > threshold) {
          if (LOG.isDebugEnabled()) {
            LOG.debug(String.format("Combined %d Put(s) into %d.", cnt, 1));
          }
          context.write(row, put);
          put = null;
          curSize = 0;
          cnt = 0;
        }
      }
    }
    if (put != null) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(String.format("Combined %d Put(s) into %d.", cnt, 1));
      }
      context.write(row, put);
    }
  }

因为hbase 输出都是一个cell单元，如果一行记录包含多个列，就需要这个东西。将相同rowkey的数据放在一块。

对于reduce 根本不需，指定输出格式就行。然后就是位置。

这样Export 过程结束：

2）Import.java

这个刚好相反。需要关注reduce过程

public static Job createSubmittableJob(Configuration conf, String[] args)
  throws IOException {
    TableName tableName = TableName.valueOf(args[0]);
    conf.set(TABLE_NAME, tableName.getNameAsString());
    Path inputDir = new Path(args[1]);
    Job job = new Job(conf, NAME + "_" + tableName);
    job.setJarByClass(Importer.class);
    FileInputFormat.setInputPaths(job, inputDir);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    String hfileOutPath = conf.get(BULK_OUTPUT_CONF_KEY);
    // make sure we get the filter in the jars
    try {
      Class<? extends Filter> filter = conf.getClass(FILTER_CLASS_CONF_KEY, null, Filter.class);
      if (filter != null) {
        TableMapReduceUtil.addDependencyJarsForClasses(conf, filter);
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
//这里直接写出kv文件。因为数据量大。按region分了
    if (hfileOutPath != null && conf.getBoolean(HAS_LARGE_RESULT, false)) {
      LOG.info("Use Large Result!!");
      try (Connection conn = ConnectionFactory.createConnection(conf); 
          Table table = conn.getTable(tableName);
          RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
        HFileOutputFormat2.configureIncrementalLoad(job, table.getTableDescriptor(), regionLocator);
        job.setMapperClass(KeyValueSortImporter.class);
        job.setReducerClass(KeyValueReducer.class);
        Path outputDir = new Path(hfileOutPath);
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setMapOutputKeyClass(KeyValueWritableComparable.class);
        job.setMapOutputValueClass(KeyValue.class);
        job.getConfiguration().setClass("mapreduce.job.output.key.comparator.class", 
            KeyValueWritableComparable.KeyValueWritableComparator.class,
            RawComparator.class);
        Path partitionsPath = 
            new Path(TotalOrderPartitioner.getPartitionFile(job.getConfiguration()));
        FileSystem fs = FileSystem.get(job.getConfiguration());
        fs.deleteOnExit(partitionsPath);
        job.setPartitionerClass(KeyValueWritableComparablePartitioner.class);
        job.setNumReduceTasks(regionLocator.getStartKeys().length);
        TableMapReduceUtil.addDependencyJarsForClasses(job.getConfiguration(),
            com.google.common.base.Preconditions.class);
      }
//没有分区。
    } else if (hfileOutPath != null) {
      job.setMapperClass(KeyValueImporter.class);
      try (Connection conn = ConnectionFactory.createConnection(conf); 
          Table table = conn.getTable(tableName);
          RegionLocator regionLocator = conn.getRegionLocator(tableName)){
        job.setReducerClass(KeyValueSortReducer.class);
        Path outputDir = new Path(hfileOutPath);
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(KeyValue.class);
        HFileOutputFormat2.configureIncrementalLoad(job, table.getTableDescriptor(), regionLocator);
        TableMapReduceUtil.addDependencyJarsForClasses(job.getConfiguration(),
            com.google.common.base.Preconditions.class);
      }
    } else {
//这个直接调用内TableOutputFarmat。这样就直接调用的是put。这个少量还好，多了不行。
//具体的write见下面的代码
      // No reducers.  Just write straight to table.  Call initTableReducerJob
      // because it sets up the TableOutputFormat.
      job.setMapperClass(Importer.class);
      TableMapReduceUtil.initTableReducerJob(tableName.getNameAsString(), null, job);
      job.setNumReduceTasks(0);
    }
    return job;
  }

这个主要的就是标红的地方，定义reduce的个数，定义reduce的输出是按region来分区的。这样就ok了。

这里的partition也是按照startkey来区分的

   private static KeyValueWritableComparable[] START_KEYS = null;
    @Override
    public int getPartition(KeyValueWritableComparable key, KeyValue value,
        int numPartitions) {
      for (int i = 0; i < START_KEYS.length; ++i) {
        if (key.compareTo(START_KEYS[i]) <= 0) {
          return i;
        }
      }
      return START_KEYS.length;
    }
  
  }

 @Override
    public void write(KEY key, Mutation value)
    throws IOException {
      if (!(value instanceof Put) && !(value instanceof Delete)) {
        throw new IOException("Pass a Delete or a Put");
      }
      mutator.mutate(value);
    }

生成的kv文件怎么load到hbase里面呢，需要调用另外一个类LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
LoadIncrementalHFiles
重要的东西多说一点。

然后第三种表之间copy 用CopyTable

到此结束。下面是拷贝过来的。算是总结了
 在对于大量的数据导入到Hbase中, 如果一条一条进行插入, 则太耗时了, 所以可以先采用MapReduce生成HFile文件, 然后使用BulkLoad导入hbase中. 引用:一、这种方式有很多的优点：1. 如果我们一次性入库hbase巨量数据，处理速度慢不说，还特别占用Region资源， 一个比较高效便捷的方法就是使用 “Bulk Loading”方法，即HBase提供的HFileOutputFormat类。2. 它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理，直接生成这种hdfs内存储的数据格式文件，然后上传至合适位置，即完成巨量数据快速入库的办法。配合mapreduce完成，高效便捷，而且不占用region资源，增添负载。二、这种方式也有很大的限制：1. 仅适合初次数据导入，即表内数据为空，或者每次入库表内都无数据的情况。2. HBase集群与Hadoop集群为同一集群，即HBase所基于的HDFS为生成HFile的MR的集群.

阅读全文

0 0

hbase 源代码分析 （17）MapReduce 过程

hbase 源代码分析（17）MapReduce 过程