Hadoop MapReduce 配置加载机制

来源：互联网发布：淘宝店铺的网址怎么看编辑：程序博客网时间：2024/05/22 03:46

前言

我们运行Hadoop MapReduce程序之前，都会配置job对象，通常的程序入口如下编写：

  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length < 2) {      System.err.println("Usage: wordcount <in> [<in>...] <out>");      System.exit(2);    }    Job job = new Job(conf, "word count");    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);

这里主要涉及两个类，Configuration 类和Job 类，我们通过这两个类出发，来仔细研究一下Hadoop MapReduce 配置加载机制

继承关系

Job类，JobContextImpl 类，JobContext接口，MRJobConfig接口的继承关系如下

 class Job extends JobContextImpl implements JobContext class JobContextImpl implements JobContext interface JobContext extends MRJobConfig

MRJobConfig接口中存放了MapReduce程序所有的可以配置的参数的Key值。

public interface MRJobConfig {  // Put all of the attribute names in here so that Job and JobContext are  // consistent.  public static final String INPUT_FORMAT_CLASS_ATTR = "mapreduce.job.inputformat.class";  public static final String MAP_CLASS_ATTR = "mapreduce.job.map.class";  public static final String MAP_OUTPUT_COLLECTOR_CLASS_ATTR                                  = "mapreduce.job.map.output.collector.class";

JobContext 继承MRJobConfig ，里面的方法全部为get方法，将一些在job运行时可以获取的参数暴露出来。

/** * A read-only view of the job that is provided to the tasks while they * are running. */@InterfaceAudience.Public@InterfaceStability.Evolvingpublic interface JobContext extends MRJobConfig {  /**   * Return the configuration for the job.   * @return the shared configuration object   */  public Configuration getConfiguration();  /**   * Get credentials for the job.

JobContextImpl 实现了 JobContext 接口中全部的get方法。而Job 继承JobContextImpl ，并添加了各种Set方法。
那么如何获取一个job对象呢？推荐的方法如下：

  /**   * Creates a new {@link Job} with no particular {@link Cluster} and a    * given {@link Configuration}.   *    * The <code>Job</code> makes a copy of the <code>Configuration</code> so    * that any necessary internal modifications do not reflect on the incoming    * parameter.   *    * A Cluster will be created from the conf parameter only when it's needed.   *    * @param conf the configuration   * @return the {@link Job} , with no connection to a cluster yet.   * @throws IOException   */  public static Job getInstance(Configuration conf) throws IOException {    // create with a null Cluster    JobConf jobConf = new JobConf(conf);    return new Job(jobConf);  }

这里就引出了Configuration 类。

Configuration 类

public class Configuration implements Iterable<Map.Entry<String,String>>,                                      Writable {

Configuration 类中也是一些列的set和get方法，可以看做一个记录了用户的配置项的大的字典。
任何的Configuration类对象都会默认加载下面的两个文件：core-default.xml和core-site.xml，这两个文件没有加任何的路径前缀，说明是配置在classpath中的。

  static{    //print deprecation warning if hadoop-site.xml is found in classpath    ClassLoader cL = Thread.currentThread().getContextClassLoader();    if (cL == null) {      cL = Configuration.class.getClassLoader();    }    if(cL.getResource("hadoop-site.xml")!=null) {      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "          + "mapred-site.xml and hdfs-site.xml to override properties of " +          "core-default.xml, mapred-default.xml and hdfs-default.xml " +          "respectively");    }    addDefaultResource("core-default.xml");    addDefaultResource("core-site.xml");  }

我们都知道字典Configuration 中的kv对如果有重复的话后面添加的会覆盖前面添加的，因此，core-default.xml和core-site.xml中如果有相同的属性配置的话，core-site.xml会覆盖掉core-default.xml中的配置。

JobConf

Configuration 的对象 conf 作为参数用于构造JobConf类对象。

public class JobConf extends Configuration {

在本类中定义了一些Get和Set方法，以及这些方法中可能用到的常量值，用户可以先得到JobConf的对象，然后在生成job对象，用法如下：

 *     // Create a new JobConf *     JobConf job = new JobConf(new Configuration(), MyJob.class); *      *     // Specify various job-specific parameters      *     job.setJobName("myjob"); *      *     FileInputFormat.setInputPaths(job, new Path("in")); *     FileOutputFormat.setOutputPath(job, new Path("out")); *      *     job.setMapperClass(MyJob.MyMapper.class); *     job.setCombinerClass(MyJob.MyReducer.class); *     job.setReducerClass(MyJob.MyReducer.class); *      *     job.setInputFormat(SequenceFileInputFormat.class); *     job.setOutputFormat(SequenceFileOutputFormat.class);

如何添加用户自定义配置？

需求：

自定义combiner触发次数–Combiner在map与reduce之间，针对每个key，有可能会被平台调用若干次，修改代码，完成在map端对每个key，combiner被调用且只被调用一次。–修改务必完整，无论处理的数据规模如何。•具体使用方法 <property>    <name>mapreduce.combiner.run.only.once</name>    <value>true</value> </property>  或者hadoop jar xxx.jar -D mapreduce.combiner.run.only.once=true <main_class>

mapreduce.combiner.run.only.once 这个配置属性在Hadoop中是不存在的。
我们在MRJobConfig中把这个属性加上

public static final String COMBINER_RUN_ONLY_ONCE = "mapreduce.combiner.run.only.once";

并在JobContext中编写其获取和设置方式。

0 0