Hadoop Configuration配置类的分析

来源：互联网发布：出售淘宝买家资料编辑：程序博客网时间：2024/06/03 17:42

学习Hadoop Common模块，当然应该是从最简单，最基础的模块学习最好，所以我挑选了其中的conf配置模块进行学习。整体的类结构非常简单。

只要继承了Configurable接口，一般表明就是可配置的，可以执行相应的配置操作，但是配置的集中操作的体现是在Configuration这个类中。这个类中定义了很多的集合变量：

/**   * List of configuration resources.   */  private ArrayList<Object> resources = new ArrayList<Object>();  /**   * List of configuration parameters marked <b>final</b>.    * finalParameters集合中保留的是final修饰的不可变的参数   */  private Set<String> finalParameters = new HashSet<String>();  /**   * 是否加载默认资源配置   */  private boolean loadDefaults = true;    /**   * Configuration objects   * Configuration对象   */  private static final WeakHashMap<Configuration,Object> REGISTRY =     new WeakHashMap<Configuration,Object>();    /**   * List of default Resources. Resources are loaded in the order of the list    * entries   */  private static final CopyOnWriteArrayList<String> defaultResources =    new CopyOnWriteArrayList<String>();

上面只是列举出了一部分，基本的用途都是拿来保存一些资源的数据。还有一个变量比较关键：

//资源配置文件中的属性会加载到Properties属性中来  private Properties properties;

所有的属性变量都是存放到java中的Properties中存放，便于后面的直接存取。Property其实就是一个HashTable。我们按着Configuration加载的顺序来学习一下他的整个过程。首先当然是执行初始化代码块:

static{    //print deprecation warning if hadoop-site.xml is found in classpath    ClassLoader cL = Thread.currentThread().getContextClassLoader();    if (cL == null) {      cL = Configuration.class.getClassLoader();    }    if(cL.getResource("hadoop-site.xml")!=null) {      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "          + "mapred-site.xml and hdfs-site.xml to override properties of " +          "core-default.xml, mapred-default.xml and hdfs-default.xml " +          "respectively");    }    //初始化中加载默认配置文件，core-site是用户的属性定义    //如果有相同,后者的属性会覆盖前者的属性    addDefaultResource("core-default.xml");    addDefaultResource("core-site.xml");  }

学习过java构造函数的执行顺序的同学，应该知道初始化代码块中的代码的执行顺序是先于构造函数的，所以会执行完上面的操作，就来到了addDefaultResource():

/**   * Add a default resource. Resources are loaded in the order of the resources    * added.   * @param name file name. File should be present in the classpath.   */  public static synchronized void addDefaultResource(String name) {    if(!defaultResources.contains(name)) {      defaultResources.add(name);      //遍历注册过的资源配置，进行重新加载操作      for(Configuration conf : REGISTRY.keySet()) {        if(conf.loadDefaults) {          conf.reloadConfiguration();        }      }    }  }

把资源的名字加入到相应的集合中，然后遍历每个配置类，重新加载配置操作，因为默认资源列表改动了，所以要重新加载了，这个也好理解。这里简单介绍一下，每一个Configuration类初始化后，都会加入到REGISTRY集合中，这是一个static 变量，所以会保持全局统一的一个。然后把重点移到reloadConfiguration():

 /**   * Reload configuration from previously added resources.   *   * This method will clear all the configuration read from the added    * resources, and final parameters. This will make the resources to    * be read again before accessing the values. Values that are added   * via set methods will overlay values read from the resources.   */  public synchronized void reloadConfiguration() {//重新加载Configuration就是重新将里面的属性记录清空    properties = null;                            // trigger reload    finalParameters.clear();                      // clear site-limits  }

操作非常简单，就是clear一些操作，也许这时候，你会想难道不用马上加载新的资源吗？其实这也是作者的一大设计，答案在后面。好的，程序执行到这里，初始化代码块的操作完成了，接下来就是构造函数的执行了:

/** A new configuration. */  public Configuration() {//初始化是需要加载默认资源的    this(true);  }

然后继续调用重载函数:

/** A new configuration where the behavior of reading from the default    * resources can be turned off.   *    * If the parameter {@code loadDefaults} is false, the new instance   * will not load resources from the default files.    * @param loadDefaults specifies whether to load from the default files   */  public Configuration(boolean loadDefaults) {    this.loadDefaults = loadDefaults;    if (LOG.isDebugEnabled()) {      LOG.debug(StringUtils.stringifyException(new IOException("config()")));    }    synchronized(Configuration.class) {      //加载过的Configuration对象对会加入到REGISTRY集合中      REGISTRY.put(this, null);    }    this.storeResource = false;  }

重点观察人家把当前初始化的Configuration类加入到全局REGISTRY里面了。

以上分析的代码都是前期的操作，那么比较关键的set/get这类和属性直接相关的方法怎么实现的，所以这个时候，必须要先了解Hadoop中的配置文件是怎样的格式存在于文件中的。比如HDFS的配置文件hdfs-site.xml;

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><!-- file system properties -->  <property>    <name>dfs.name.dir</name>    <value>/var/local/hadoop/hdfs/name</value>    <description>Determines where on the local filesystem the DFS name node      should store the name table.  If this is a comma-delimited list      of directories then the name table is replicated in all of the      directories, for redundancy. </description>    <final>true</final>  </property>  <property>    <name>dfs.data.dir</name>    <value>/var/local/hadoop/hdfs/data</value>    <description>Determines where on the local filesystem an DFS data node       should store its blocks.  If this is a comma-delimited       list of directories, then data will be stored in all named       directories, typically on different devices.       Directories that do not exist are ignored.    </description>    <final>true</final>  </property>.......

节点层级的关系不是很复杂，关键在每个Property节点保留name名字，value值，des对于此属性的描述，final标签用于判断此属性能不能被改变，为true代表无法变更，类似于java语言里的final关键字。了解完配置文件的结构之后，就可以继续往下看了，比如我要设置1个属性，set的一个小小的方法如下:

/**    * Set the <code>value</code> of the <code>name</code> property.   *    * @param name property name.   * @param value property value.   * 根据name设置属性值，属性键值对保存在property中   */  public void set(String name, String value) {    getOverlay().setProperty(name, value);    getProps().setProperty(name, value);  }

后面的setProperty就是Property的设置方法，jdk的API，所以关键就是前面获取getProps的方法，如何把文件中的属性加载到Property的变量中的。

/**   * 加载的时候采用了延时加载的策略   * @return   */  private synchronized Properties getProps() {    if (properties == null) {      properties = new Properties();      //从资源中再次获取属性相关的数据      loadResources(properties, resources, quietmode);      if (overlay!= null) {        properties.putAll(overlay);        if (storeResource) {          for (Map.Entry<Object,Object> item: overlay.entrySet()) {            updatingResource.put((String) item.getKey(), "Unknown");          }        }      }    }    return properties;  }

看了上面为NULL的判断，也许你就知道为什么刚刚的重新加载操作那么简单，就执行了clear操作就完了，就是等着后面真正要获取这个Property的时候在加载的，就是所谓的延时加载策略，类似于单例模式中的懒汉式模型。所以loadResources又是此实现的关键:

private void loadResource(Properties properties, Object name, boolean quiet) {    try {      //工厂模式获取解析xml文件对象，这里用的是doc解析方式      DocumentBuilderFactory docBuilderFactory         = DocumentBuilderFactory.newInstance();      //ignore all comments inside the xml file      docBuilderFactory.setIgnoringComments(true);      //allow includes in the xml file      docBuilderFactory.setNamespaceAware(true);      try {          docBuilderFactory.setXIncludeAware(true);      } catch (UnsupportedOperationException e) {        LOG.error("Failed to set setXIncludeAware(true) for parser "                + docBuilderFactory                + ":" + e,                e);      }      DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();      .....      if (root == null) {    //获取xml中的节点进行获取，这里先获取了根节点        root = doc.getDocumentElement();      }      if (!"configuration".equals(root.getTagName()))        LOG.fatal("bad conf file: top-level element not <configuration>");      NodeList props = root.getChildNodes();      for (int i = 0; i < props.getLength(); i++) {        Node propNode = props.item(i);        if (!(propNode instanceof Element))          continue;        Element prop = (Element)propNode;        if ("configuration".equals(prop.getTagName())) {          //如果子节点是configuration，则再次递归调用loadResource()方法          loadResource(properties, prop, quiet);          continue;        }        if (!"property".equals(prop.getTagName()))          LOG.warn("bad conf file: element not <property>");        NodeList fields = prop.getChildNodes();        String attr = null;        String value = null;        boolean finalParameter = false;        for (int j = 0; j < fields.getLength(); j++) {          Node fieldNode = fields.item(j);          if (!(fieldNode instanceof Element))            continue;          //属性节点分3种判断，name,value,final          Element field = (Element)fieldNode;          if ("name".equals(field.getTagName()) && field.hasChildNodes())            attr = ((Text)field.getFirstChild()).getData().trim();          if ("value".equals(field.getTagName()) && field.hasChildNodes())            value = ((Text)field.getFirstChild()).getData();          if ("final".equals(field.getTagName()) && field.hasChildNodes())        //final参数需额外添加到finalParameters参数的集合中            finalParameter = "true".equals(((Text)field.getFirstChild()).getData());        }                // Ignore this parameter if it has already been marked as 'final'        if (attr != null) {          if (value != null) {            if (!finalParameters.contains(attr)) {              //在这步把上面去的值放入properties属性中              properties.setProperty(attr, value);              if (storeResource) {                updatingResource.put(attr, name.toString());              }            } else if (!value.equals(properties.getProperty(attr))) {              LOG.warn(name+":a attempt to override final parameter: "+attr                     +";  Ignoring.");            }          }          if (finalParameter) {            finalParameters.add(attr);          }        }      }

和上面我们看的实际配置文件一对照，就不难理解了，就是简单的doc解析xml文件，这里不过多了一些处理，比如final的参数要额外再做一下操作。加载完成之后，属性信息就被放到了Property中了，就达成了目标了。

下面我们说说get的属性获取操作，同样有别样的设计，他可不仅仅是getProps().get(name)这样的操作,因为有的时候，通过这样的操作还无法取出真正想要的值。比如下面这样的结构:

<property>    <name>dfs.secondary.namenode.kerberos.principal</name>    <value>hdfs/_HOST@${local.realm}</value>    <description>        Kerberos principal name for the secondary NameNode.    </description>  </property>

也许你会直接通过dfs.secondary.namenode.kerberos.principal这个name去获取值，然后获取的值就是hdfs/_HOST@${local.realm}，但是很显然这不是我们需要的值，因为中间还有${local.realm}，这个其实代表的是另外的一个设置的值，有的时候更多的是系统变量的值，所以这一点告诉我们，在值的查找操作里面我们需要替换这些变量。

/**   * Get the value of the <code>name</code> property, <code>null</code> if   * no such property exists.   *    * Values are processed for <a href="#VariableExpansion">variable expansion</a>    * before being returned.    *    * @param name the property name.   * @return the value of the <code>name</code> property,    *         or null if no such property exists.   */  public String get(String name) {    return substituteVars(getProps().getProperty(name));  }

所以Hadoop在获取值后又进行了一步值替换的操作，用到了正则表达式。

//需匹配的模式为\$\{[^\}\$ ]+\}，里面多的\是在java里进行转义  //$,{,}是正则表达式中的保留字，因此需要加\,此匹配可分解为  //'\$\'{匹配的是${的部分  //最后的'\}'匹配了结尾符}，这样就构成了初步的${....}的目标类型结构了  //中间[^\}\$ ]匹配了除了},$，空格除外的关键字  //+是1个修饰次数，保证中间的匹配至少为1次，也就是说中间至少有值存在  private static Pattern varPat = Pattern.compile("\\$\\{[^\\}\\$\u0020]+\\}");  private static int MAX_SUBST = 20;  private String substituteVars(String expr) {//输入的属性匹配值，为空的话直接返回    if (expr == null) {      return null;    }    Matcher match = varPat.matcher("");    String eval = expr;    //避免循环迭代陷入死循环，这里强制最多MAX_SUBST20次的替换    for(int s=0; s<MAX_SUBST; s++) {      match.reset(eval);      //寻找模式是否匹配      if (!match.find()) {        return eval;      }      String var = match.group();      //找到之后去掉头尾的${，和},截取出中间部分      var = var.substring(2, var.length()-1); // remove ${ .. }      String val = null;      try {    //看看此属性是否为系统变量        val = System.getProperty(var);      } catch(SecurityException se) {        LOG.warn("Unexpected SecurityException in Configuration", se);      }      if (val == null) {        val = getRaw(var);      }      if (val == null) {        return eval; // return literal ${var}: var is unbound      }      // substitute      //然后取出对应的值进行替换，再次查找是否有${..}类型值的存在      eval = eval.substring(0, match.start())+val+eval.substring(match.end());    }    throw new IllegalStateException("Variable substitution depth too large: "                                     + MAX_SUBST + " " + expr);  }

关键的难点是对于${...}这种模式的匹配器的构造，像我这种平时对于正则表达式第一想到的是上网找的人来说，就比较难想到了。还有1个特殊处理就是为了避免替换之后还会存在${...}出现死循环，所以这里有次数的限制。get操作的实现就是如此。最后看看我对于Configuration做的2条不同情况下的流程分析图:

配置类代码的实现应该说是短小和精炼，以后开发大型系统的时候完全可以借鉴此类似的原理。

1 0