Hadoop 1.x 与 2.x 中 fsimage 和 edits 合并实现

来源：互联网发布：python ctypes 详解编辑：程序博客网时间：2024/05/29 04:26

本文部分转自 Hadoop 1.x中fsimage和edits合并实现
本文部分转自 Hadoop 2.x中fsimage和edits合并实现
本文部分转自 hadoop 2.2.0 关于 fsimage & edit log 的相关配置

一 Hadoop fsimage 和 edits 合并背景需求
二 Hadoop 1x 中 fsimage 和 edits 合并实现
- 1 SecondaryNamenode 简介
- 2 SecondaryNamenode 工作情况
- 3 SecondaryNamenode 具体配置
三 Hadoop 2x 版本 edits 和 fsimage 文件合并
- 1 edits 和 fsimage 文件合并方式
- 2 edits 和 fsimage 文件合并实现机制
四 Hadoop 2x 日志合并的处理逻辑与配置总结
- 1 fsimage 和 edit logs 的处理逻辑
- 2 fsimage 和 edit logs 的配置总结

一. Hadoop fsimage 和 edits 合并背景需求

在《Hadoop NameNode元数据相关文件目录解析》文章中谈到了 fsimage 和 edits 的概念、作用等相关知识，正如前面说到，在 NameNode 运行期间，HDFS 的所有更新操作都是直接写到 edits 中，久而久之 edits 文件将会变得很大；虽然这对 NameNode 运行时候是没有什么影响的，但是我们知道当 NameNode 重启的时候，NameNode 先将 fsimage 里面的所有内容映像到内存中，然后再一条一条地执行 edits 中的记录，当 edits 文件非常大的时候，会导致 NameNode 启动操作非常地慢，而在这段时间内 HDFS 系统处于安全模式，这显然不是用户要求的。能不能在 NameNode 运行的时候使得 edits 文件变小一些呢？

二. Hadoop 1.x 中 fsimage 和 edits 合并实现

其实是可以的，接下来主要是针对 Hadoop 1.x 版本，说明其是怎么将 edits 和 fsimage 文件合并的。

2.1 SecondaryNamenode 简介

用过 Hadoop 的用户应该都知道在 Hadoop 里面有个 SecondaryNamenode 进程，从名字看来大家很容易将它当作 NameNode 的热备进程。其实真实的情况不是这样的。

SecondaryNamenode 是 HDFS 架构中的一个组成部分，它是用来保存 namenode 中对 HDFS metadata 的信息的备份，并减少 namenode 重启的时间而设定的！

2.2 SecondaryNamenode 工作情况

一般都是将 SecondaryNamenode 单独运行在一台机器上，那么 SecondaryNamenode 是如何减少 namenode 重启的时间的呢？来看看 SecondaryNamenode 的工作情况：

SecondaryNamenode 会定期的和 NameNode 通信，请求其停止使用 edits 文件，暂时将新的写操作写到一个新的文件 edit.new 上来，这个操作是瞬间完成，上层写日志的函数完全感觉不到差别
SecondaryNamenode 通过 HTTP GET 方式从 NameNode 上获取到 fsimage 和 edits 文件，并下载到本地的相应目录下
SecondaryNamenode 将下载下来的 fsimage 载入到内存，然后一条一条地执行 edits 文件中的各项更新操作，使得内存中的 fsimage 保存最新；这个过程就是 edits 和 fsimage 文件合并；
SecondaryNamenode 执行完上述操作之后，会通过 post 方式将新的 fsimage 文件发送到 NameNode 节点上
NameNode 将从 SecondaryNamenode 接收到的新的 fsimage 替换旧的 fsimage 文件，同时将 edit.new 替换 edits 文件，通过这个过程 edits 就变小了！

整个过程的执行可以通过下面的图说明：

这里写图片描述

2.3 SecondaryNamenode 具体配置

在第一步时，我们谈到 SecondaryNamenode 会定期的和 NameNode 通信，这个是需要配置的，可以通过 core-site.xml 进行配置，下面是默认关于检查时间的配置：

<property>  <name>fs.checkpoint.period</name>  <value>3600</value>  <description>The number of seconds between two periodic checkpoints.  </description></property>

其实如果当 fs.checkpoint.period 配置的时间还没有到期，我们也可以通过判断当前的 edits 大小来触发一次合并的操作，可以通过下面配置

<property>  <name>fs.checkpoint.size</name>  <value>67108864</value>  <description>The size of the current edit log (in bytes) that triggers       a periodic checkpoint even if the fs.checkpoint.period hasn't expired.  </description></property>

当 edits 文件大小超过以上配置，即使 fs.checkpoint.period 还没到，也会进行一次合并。顺便说说 SecondaryNamenode 下载下来的 fsimage 和 edits 暂时存放的路径可以通过下面的属性进行配置：

<property>  <name>fs.checkpoint.dir</name>  <value>${hadoop.tmp.dir}/dfs/namesecondary</value>  <description>Determines where on the local filesystem the DFS secondary      name node should store the temporary images to merge.      If this is a comma-delimited list of directories then the image is      replicated in all of the directories for redundancy.  </description></property><property>  <name>fs.checkpoint.edits.dir</name>  <value>${fs.checkpoint.dir}</value>  <description>Determines where on the local filesystem the DFS secondary      name node should store the temporary edits to merge.      If this is a comma-delimited list of directoires then teh edits is      replicated in all of the directoires for redundancy.      Default value is same as fs.checkpoint.dir  </description></property>

从上面的描述我们可以看出，SecondaryNamenode 根本就不是 Namenode 的一个热备，其只是将 fsimage 和 edits 合并。其拥有的 fsimage 不是最新的，因为在它从 NameNode 下载 fsimage 和 edits 文件时候，新的更新操作已经写到 edit.new 文件中去了。而这些更新在 SecondaryNamenode 是没有同步到的！当然，如果 NameNode 中的 fsimage 真的出问题了，还是可以用 SecondaryNamenode 中的 fsimage 替换一下 NameNode 上的 fsimage，虽然已经不是最新的 fsimage，但是我们可以将损失减小到最少！

三. Hadoop 2.x 版本 edits 和 fsimage 文件合并

我们知道，在 Hadoop 2.x 中解决了 NameNode 的单点故障问题；同时 SecondaryName 已经不用了，而之前的 Hadoop 1.x 中是通过 SecondaryName 来合并 fsimage 和 edits 以此来减小 edits 文件的大小，从而减少 NameNode 重启的时间。而在 Hadoop 2.x 中已经不用 SecondaryName，那它是怎么来实现 fsimage 和 edits 合并的呢？

3.1 edits 和 fsimage 文件合并方式

在 Hadoop 2.x 通过配置 JournalNode 来实现 Hadoop 的高可用性，这样主备 NameNode上的 fsimage 和 edits 都是最新的，任何时候只要有一台 NameNode 挂了，也可以使得集群中的 fsimage 是最新状态！在 Hadoop 2.x 中提供了 HA 机制（解决 NameNode 单点故障），可以通过配置奇数个 JournalNode 来实现 HA，如何配置今天就不谈了！HA 机制通过在同一个集群中运行两个 NN（active NN & standby NN）来解决 NameNode 的单点故障，在任何时间，只有一台机器处于 Active 状态；另一台机器是处于 Standby 状态。Active NN 负责集群中所有客户端的操作；而 Standby NN 主要用于备用，它主要维持足够的状态，如果必要，可以提供快速的故障恢复。

为了让 Standby NN 的状态和 Active NN 保持同步，即元数据保持一致，它们都将会和 JournalNodes 守护进程通信。当 Active NN 执行任何有关命名空间的修改，它需要持久化到一半以上的 JournalNodes上(通过 edits log 持久化存储)，而 Standby NN 负责观察 edits log 的变化，它能够读取从 JNs 中读取 edits 信息，并更新其内部的命名空间。一旦 Active NN 出现故障，Standby NN 将会保证从 JNs 中读出了全部的 Edits，然后切换成 Active 状态。Standby NN 读取全部的 edits 可确保发生故障转移之前，是和 Active NN 拥有完全同步的命名空间状态。

3.2 edits 和 fsimage 文件合并实现机制

那么这种机制是如何实现 fsimage 和 edits 的合并？在 standby NameNode 节点上会一直运行一个叫做 CheckpointerThread 的线程，这个线程调用 StandbyCheckpointer 类的 doWork() 函数，而 doWork 函数会每隔 Math.min(checkpointCheckPeriod, checkpointPeriod) 秒来坐一次合并操作，相关代码如下：

try {          Thread.sleep(1000 * checkpointConf.getCheckPeriod());        } catch (InterruptedException ie) {}public long getCheckPeriod() {    return Math.min(checkpointCheckPeriod, checkpointPeriod);}checkpointCheckPeriod = conf.getLong(        DFS_NAMENODE_CHECKPOINT_CHECK_PERIOD_KEY,        DFS_NAMENODE_CHECKPOINT_CHECK_PERIOD_DEFAULT);checkpointPeriod = conf.getLong(DFS_NAMENODE_CHECKPOINT_PERIOD_KEY,                                 DFS_NAMENODE_CHECKPOINT_PERIOD_DEFAULT);

上面的 checkpointCheckPeriod 和 checkpointPeriod 变量是通过获取 hdfs-site.xml 以下两个属性的值得到：

<property>  <name>dfs.namenode.checkpoint.period</name>  <value>3600</value>  <description>The number of seconds between two periodic checkpoints.  </description></property><property>  <name>dfs.namenode.checkpoint.check.period</name>  <value>60</value>  <description>The SecondaryNameNode and CheckpointNode will poll the NameNode  every 'dfs.namenode.checkpoint.check.period' seconds to query the number  of uncheckpointed transactions.  </description></property>

当达到下面两个条件的情况下，将会执行一次 checkpoint：

boolean needCheckpoint = false;if (uncheckpointed >= checkpointConf.getTxnCount()) {     LOG.info("Triggering checkpoint because there have been " +                 uncheckpointed + " txns since the last checkpoint, which " +                "exceeds the configured threshold " +                checkpointConf.getTxnCount());     needCheckpoint = true;} else if (secsSinceLast >= checkpointConf.getPeriod()) {     LOG.info("Triggering checkpoint because it has been " +            secsSinceLast + " seconds since the last checkpoint, which " +             "exceeds the configured interval " + checkpointConf.getPeriod());     needCheckpoint = true;}

当上述 needCheckpoint 被设置成 true 的时候，StandbyCheckpointer 类的 doWork() 函数将会调用 doCheckpoint() 函数正式处理 checkpoint。当 fsimage 和 edits 的合并完成之后，它将会把合并后的 fsimage 上传到 Active NameNode 节点上，Active NameNode 节点下载完合并后的 fsimage，再将旧的 fsimage 删掉（Active NameNode上的）同时清除旧的 edits 文件。步骤可以归类如下：

步骤一：配置好 HA 后，客户端所有的更新操作将会写到 JournalNodes 节点的共享目录中，可以通过下面配置

<property>　　<name>dfs.namenode.shared.edits.dir</name>　　<value>qjournal://XXXX/mycluster</value></property><property>　　<name>dfs.journalnode.edits.dir</name>　　<value>/export1/hadoop2x/dfs/journal</value></property>

步骤二： Active Namenode 和 Standby NameNode 从 JournalNodes 的 edits 共享目录中同步 edits 到自己 edits 目录中

步骤三： Standby NameNode 中的 StandbyCheckpointer 类会定期的检查合并的条件是否成立，如果成立会合并 fsimage 和 edits 文件

步骤四： Standby NameNode 中的 StandbyCheckpointer 类合并完之后，将合并之后的 fsimage 上传到 Active NameNode 相应目录中；

步骤五： Active NameNode 接到最新的 fsimage 文件之后，将旧的 fsimage 和 edits 文件清理掉

步骤六：通过上面的几步，fsimage 和 edits 文件就完成了合并，由于 HA 机制，会使得 Standby NameNode 和 Active NameNode 都拥有最新的 fsimage 和 edits 文件（之前 Hadoop 1.x 的 SecondaryNameNode 中的 fsimage 和 edits 不是最新的）

四. Hadoop 2.x 日志合并的处理逻辑与配置总结

4.1 fsimage 和 edit logs 的处理逻辑

在类 org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager 的 purgeOldStorage() 方法中描述了 fsimage 和 edit logs 的处理逻辑：

一、找到存在于 fsimage 中的最小 txid，删除比最小 txid 小的 fsimage

二、最小 txid - dfs.namenode.num.extra.edits.retained = 可以删除 txid 集合

三、可删除 txid 集合 > dfs.namenode.max.extra.edits.segments.retained 时，删除集合中的最小值

该逻辑中的 dfs.namenode.max.extra.edits.segments.retained 在下面属性中设置

<property>    <name>dfs.namenode.max.extra.edits.segments.retained</name>    <value>10000</value>    <description>The maximum number of extra edit log segments which should be retained    beyond what is minimally necessary for a NN restart. When used in conjunction with    dfs.namenode.num.extra.edits.retained, this configuration property serves to cap    the number of extra edits files to a reasonable value.    </description>  </property>

4.2 fsimage 和 edit logs 的配置总结

1. 设置 secondary namenode 放置临时 image 位置目录

<property>    <name>dfs.namenode.checkpoint.dir</name>    <value>file://${hadoop.tmp.dir}/dfs/namesecondary</value>    <description>Determines where on the local filesystem the DFS secondary        name node should store the temporary images to merge.        If this is a comma-delimited list of directories then the image is        replicated in all of the directories for redundancy.    </description>  </property>  <property>    <name>dfs.namenode.checkpoint.edits.dir</name>    <value>${dfs.namenode.checkpoint.dir}</value>    <description>Determines where on the local filesystem the DFS secondary        name node should store the temporary edits to merge.        If this is a comma-delimited list of directoires then teh edits is        replicated in all of the directoires for redundancy.        Default value is same as dfs.namenode.checkpoint.dir    </description>  </property>

2. 设置 checkpoint 和 checkpoint.check 周期

<property>    <name>dfs.namenode.checkpoint.period</name>    <value>3600</value>    <description>The number of seconds between two periodic checkpoints.    </description>  </property>  <property>    <name>dfs.namenode.checkpoint.check.period</name>    <value>60</value>    <description>The SecondaryNameNode and CheckpointNode will poll the NameNode    every 'dfs.namenode.checkpoint.check.period' seconds to query the number    of uncheckpointed transactions.    </description>  </property>

3. 设置最大的 txns，超出后会合并文件

<property>    <name>dfs.namenode.checkpoint.txns</name>    <value>1000000</value>    <description>The Secondary NameNode or CheckpointNode will create a checkpoint    of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless    of whether 'dfs.namenode.checkpoint.period' has expired.    </description>  </property>

4. 设置最大的重试 checkpoint 次数

<property>    <name>dfs.namenode.checkpoint.max-retries</name>    <value>3</value>    <description>The SecondaryNameNode retries failed checkpointing. If the     failure occurs while loading fsimage or replaying edits, the number of    retries is limited by this variable.     </description>  </property>

5. 设置 fsimage 和 edit log 保存的记录数，默认保存 2 份 fsimge 和 1000000 份 edits 日志信息

<property>    <name>dfs.namenode.num.checkpoints.retained</name>    <value>2</value>    <description>The number of image checkpoint files that will be retained by    the NameNode and Secondary NameNode in their storage directories. All edit    logs necessary to recover an up-to-date namespace from the oldest retained    checkpoint will also be retained.    </description>  </property>  <property>    <name>dfs.namenode.num.extra.edits.retained</name>    <value>1000000</value>    <description>The number of extra transactions which should be retained    beyond what is minimally necessary for a NN restart. This can be useful for    audit purposes or for an HA setup where a remote Standby Node may have    been offline for some time and need to have a longer backlog of retained    edits in order to start again.    Typically each edit is on the order of a few hundred bytes, so the default    of 1 million edits should be on the order of hundreds of MBs or low GBs.    NOTE: Fewer extra edits may be retained than value specified for this setting    if doing so would mean that more segments would be retained than the number    configured by dfs.namenode.max.extra.edits.segments.retained.    </description>  </property>

0 0