关于机房交换机故障导致HDFS NameNode挂掉的问题

来源：互联网发布：访客软件编辑：程序博客网时间：2024/05/22 03:05

HDFS HA中，一个active NN，一个standby NN，三个JNs，共涉及三台机器146.66/67/68。其中66上有一个JN，67上一个JN和一个active NN，68上一个JN和一个standby NN。67和68在一个机房，66在不同的机房。

发生故障的机房交换机，导致67和68都无法与66通信。所以，67上的active NN只能写入67和68上的JNs，68上的standby NN只能读取67和68上的JNs。
根据Hadoop官方文档（http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html），这种情况NNs和JN quorum应该正常工作：There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally。

但现实是，NN挂掉了。相关日志节选：
2015-11-05 03:01:37,135 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4354654650))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2748)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:590)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-11-05 03:01:37,135 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 4354654650
2015-11-05 03:01:37,139 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-11-05 03:01:37,145 INFO namenode.NameNode (StringUtils.java:run(640)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at storm14667/192.168.146.67

查看HDFS的源码，与上面的行为一致。那么问题来了，是官方文档描述有问题（“at most”！？），还是我们的理解有误，进而我们的配置有问题？

从调用栈看，NN为处理client的请求，正在分配新的block，JournalSet.java：
/**
* The client would like to obtain an additional block for the indicated
* filename (which is being written-to). Return an array that consists
* of the block, plus a set of machines. The first on this list should
* be where the client writes data. Subsequent items in the list must
* be provided in the connection to the first datanode.
*
* Make sure the previous blocks have been reported by datanodes and
* are replicated. Will return an empty 2-elt array if we want the
* client to "try again later".
*/
LocatedBlock getAdditionalBlock(String src, long fileId, String clientName,
ExtendedBlock previous, Set<Node> excludedNodes,
List<String> favoredNodes) throws IOException {
LocatedBlock[] onRetryBlock = new LocatedBlock[1];
DatanodeStorageInfo targets[] = getNewBlockTargets(src, fileId,
clientName, previous, excludedNodes, favoredNodes, onRetryBlock);
if (targets == null) {
assert onRetryBlock[0] != null : "Retry block is null";
// This is a retry. Just return the last block.
return onRetryBlock[0];
}
LocatedBlock newBlock = storeAllocatedBlock(
src, fileId, clientName, previous, targets);
return newBlock;
}

/**
* Part II of getAdditionalBlock().
* Should repeat the same analysis of the file state as in Part 1,
* but under the write lock.
* If the conditions still hold, then allocate a new block with
* the new targets, add it to the INode and to the BlocksMap.
*/
LocatedBlock storeAllocatedBlock(String src, long fileId, String clientName,
ExtendedBlock previous, DatanodeStorageInfo[] targets) throws IOException {
Block newBlock = null;
long offset;
checkOperation(OperationCategory.WRITE);
waitForLoadingFSImage();
writeLock();
try {
// 此处省略几十行
} finally {
writeUnlock();
}
getEditLog().logSync(); // 注意：这里是同步edit log，同步到NN本地或/和JNs上

// Return located block
return makeLocatedBlock(newBlock, targets, offset);
}

在JournalSet.storeAllocatedBlock函数中，可以看到先logSync再执行真正的动作，说明HDFS对写操作采用了WAL（Wrtie-Ahead Log）。

FSEditLog.java：
/**
* Sync all modifications done by this thread.
*
* The internal concurrency design of this class is as follows:
* - Log items are written synchronized into an in-memory buffer,
* and each assigned a transaction ID.
* - When a thread (client) would like to sync all of its edits, logSync()
* uses a ThreadLocal transaction ID to determine what edit number must
* be synced to.
* - The isSyncRunning volatile boolean tracks whether a sync is currently
* under progress.
*
* The data is double-buffered within each edit log implementation so that
* in-memory writing can occur in parallel with the on-disk writing.
*
* Each sync occurs in three steps:
* 1. synchronized, it swaps the double buffer and sets the isSyncRunning
* flag.
* 2. unsynchronized, it flushes the data to storage
* 3. synchronized, it resets the flag and notifies anyone waiting on the
* sync.
*
* The lack of synchronization on step 2 allows other threads to continue
* to write into the memory buffer while the sync is in progress.
* Because this step is unsynchronized, actions that need to avoid
* concurrency with sync() should be synchronized and also call
* waitForSyncToFinish() before assuming they are running alone.
*/
public void logSync() {
long syncStart = 0;

// Fetch the transactionId of this thread.
long mytxid = myTransactionId.get().txid;

boolean sync = false;
try {
EditLogOutputStream logStream = null;
//
// 此处省略上百行代码
//
// do the sync
long start = monotonicNow();
try {
if (logStream != null) {
logStream.flush(); // 注意：这里是写edit log，根据配置不同，写入本地和/或JournalNodes
}
} catch (IOException ex) {
synchronized (this) {
final String msg =
"Could not sync enough journals to persistent storage. "
+ "Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
}
} finally {
// 此处省略十几行代码
}
}

又回到JournalSet.java：

/**
* An implementation of EditLogOutputStream that applies a requested method on
* all the journals that are currently active.
*/
private class JournalSetOutputStream extends EditLogOutputStream {
//
// 此处省略上百行
//
@Override
public void flush() throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
if (jas.isActive()) {
jas.getCurrentStream().flush();
}
}
}, "flush");
}
//
// 此处省略上百行
//
}

/**
* Manages a collection of Journals. None of the methods are synchronized, it is
* assumed that FSEditLog methods, that use this class, use proper
* synchronization.
*/
public class JournalSet implements JournalManager {
//
// 此处省略数百行
//
private final List<JournalAndStream> journals =
new CopyOnWriteArrayList<JournalSet.JournalAndStream>();
//
// 此处省略近百行
//
/**
* Apply the given operation across all of the journal managers, disabling
* any for which the closure throws an IOException.
* @param closure {@link JournalClosure} object encapsulating the operation.
* @param status message used for logging errors (e.g. "opening journal")
* @throws IOException If the operation fails on all the journals.
*/
// 注意：上面的注释所表达的意思，和Hadoop官方文档中的表述，不一致，当然也或许我们理解不到位或不全面
private void mapJournalsAndReportErrors(JournalClosure closure, String status)
throws IOException {

List<JournalAndStream> badJAS = Lists.newLinkedList();
for (JournalAndStream jas : journals) {
try {
closure.apply(jas);
} catch (Throwable t) {
if (jas.isRequired()) {
final String msg = "Error: " + status + " failed for required journal ("
+ jas + ")";
LOG.fatal(msg, t); // 注意：这就是日志中出现的FATAL日志
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir. There are many code paths to shared edits failures - syncs,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(1, msg);
} else {
// 次数省略数十行
}
}
}
}
//
// 此处省略数百行
//
}

再回到FSEditLog.java，这里初始化了JournalSet.journals：

/**
* FSEditLog maintains a log of the namespace modifications.
*
*/
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class FSEditLog implements LogsPurgeable {
//
// 此处省略数百行
//
private final List<URI> editsDirs; // 注意：这个成员变量和我们的配置（hdfs-site.xml）密切相关

/**
* The edit directories that are shared between primary and secondary.
*/
private final List<URI> sharedEditsDirs; // 注意：这个成员变量和我们配置（hdfs-site.xml）密切相关
//
// 此处省略数百行
//
/**
* Constructor for FSEditLog. Underlying journals are constructed, but
* no streams are opened until open() is called.
*
* @param conf The namenode configuration
* @param storage Storage object used by namenode
* @param editsDirs List of journals to use
*/
FSEditLog(Configuration conf, NNStorage storage, List<URI> editsDirs) {
isSyncRunning = false;
this.conf = conf;
this.storage = storage;
metrics = NameNode.getNameNodeMetrics();
lastPrintTime = monotonicNow();

// If this list is empty, an error will be thrown on first use
// of the editlog, as no journals will exist
this.editsDirs = Lists.newArrayList(editsDirs);

this.sharedEditsDirs = FSNamesystem.getSharedEditsDirs(conf);
}

public synchronized void initJournalsForWrite() {
Preconditions.checkState(state == State.UNINITIALIZED ||
state == State.CLOSED, "Unexpected state: %s", state);

initJournals(this.editsDirs);
state = State.BETWEEN_LOG_SEGMENTS;
}

public synchronized void initSharedJournalsForRead() {
if (state == State.OPEN_FOR_READING) {
LOG.warn("Initializing shared journals for READ, already open for READ",
new Exception());
return;
}
Preconditions.checkState(state == State.UNINITIALIZED ||
state == State.CLOSED);

initJournals(this.sharedEditsDirs);
state = State.OPEN_FOR_READING;
}

private synchronized void initJournals(List<URI> dirs) {
int minimumRedundantJournals = conf.getInt(
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_KEY,
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_DEFAULT);

synchronized(journalSetLock) {
journalSet = new JournalSet(minimumRedundantJournals);

for (URI u : dirs) {
// 注意：这个布尔变量很关键，决定了能否容忍active NN写一个JN失败
boolean required = FSNamesystem.getRequiredNamespaceEditsDirs(conf)
.contains(u);
// 注意：在我们配置（hdfs-site.xml）中，没有显式的配置任何local sheme（file://）的配置项，所以真正执行的是对应的else分支
if (u.getScheme().equals(NNStorage.LOCAL_URI_SCHEME)) {
StorageDirectory sd = storage.getStorageDirectory(u);
if (sd != null) {
journalSet.add(new FileJournalManager(conf, sd, storage),
required, sharedEditsDirs.contains(u));
}
} else {
journalSet.add(createJournal(u), required,
sharedEditsDirs.contains(u));
}
}
}

if (journalSet.isEmpty()) {
LOG.error("No edits directories configured!");
}
}
//
// 此处省略数百行
//
}

再到FSNamesystem.java：

/**
* Get all edits dirs which are required. If any shared edits dirs are
* configured, these are also included in the set of required dirs.
*
* @param conf the HDFS configuration.
* @return all required dirs.
*/
public static Collection<URI> getRequiredNamespaceEditsDirs(Configuration conf) {
Set<URI> ret = new HashSet<URI>();
ret.addAll(getStorageDirs(conf, DFS_NAMENODE_EDITS_DIR_REQUIRED_KEY)); // 注意：我们没有配置，所以第一个addAll没有add了空
ret.addAll(getSharedEditsDirs(conf));
return ret;
}

private static Collection<URI> getStorageDirs(Configuration conf,
String propertyName) {
Collection<String> dirNames = conf.getTrimmedStringCollection(propertyName);
StartupOption startOpt = NameNode.getStartupOption(conf);
if(startOpt == StartupOption.IMPORT) {
// In case of IMPORT this will get rid of default directories
// but will retain directories specified in hdfs-site.xml
// When importing image from a checkpoint, the name-node can
// start with empty set of storage directories.
Configuration cE = new HdfsConfiguration(false);
cE.addResource("core-default.xml");
cE.addResource("core-site.xml");
cE.addResource("hdfs-default.xml");
Collection<String> dirNames2 = cE.getTrimmedStringCollection(propertyName);
dirNames.removeAll(dirNames2);
if(dirNames.isEmpty())
LOG.warn("!!! WARNING !!!" +
"\n\tThe NameNode currently runs without persistent storage." +
"\n\tAny changes to the file system meta-data may be lost." +
"\n\tRecommended actions:" +
"\n\t\t- shutdown and restart NameNode with configured \""
+ propertyName + "\" in hdfs-site.xml;" +
"\n\t\t- use Backup Node as a persistent and up-to-date storage " +
"of the file system meta-data.");
} else if (dirNames.isEmpty()) {
dirNames = Collections.singletonList(
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_DEFAULT);
}
return Util.stringCollectionAsURIs(dirNames);
}

/**
* Returns edit directories that are shared between primary and secondary.
* @param conf configuration
* @return collection of edit directories from {@code conf}
*/
public static List<URI> getSharedEditsDirs(Configuration conf) {
// don't use getStorageDirs here, because we want an empty default
// rather than the dir in /tmp
Collection<String> dirNames = conf.getTrimmedStringCollection(
DFS_NAMENODE_SHARED_EDITS_DIR_KEY);
return Util.stringCollectionAsURIs(dirNames);
}

最后，回到我们的hdfs-site.xml，相关的配置项包括：
dfs.namenode.shared.edits.dir
dfs.journalnode.edits.dir
dfs.namenode.name.dir
dfs.namenode.edits.dir
dfs.namenode.edits.dir.required
dfs.namenode.edits.dir.minimum
具体含义参考：http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml。

其中，后三项我们没有做任何配置，所以要么是默认值，要么是空。默认值也都是空，除了dfs.namenode.edits.dir。

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop146066.ysc.com:8485;storm14667:8485;storm14668:8485/hadoopcluster</value>
</property>

配置中，把三个JNs都显式的配值了，所以三个都是required（上面代码中我提到的很关键的布尔变量）的。根据代码，这样配置的作用是，任何要给JN不可通信，active NN都会shut down。

这是几个关于相关配置项的讨论：
https://issues.apache.org/jira/browse/HDFS-4342
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/I4YRcmiVcBY

0 0