Hive Job 故障，导致yhbd01 Namenode Shutdown

来源：互联网发布：临沂百度关键词优化编辑：程序博客网时间：2024/05/01 05:04

转载自：http://rocketeer.leanote.com/post/Hive-Job-%E6%95%85%E9%9A%9C%EF%BC%8C%E5%AF%BC%E8%87%B4yhbd01-Namenode-Shutdown

这是一篇刚来有货优化集群时的记载。个人认为对于新手优化集群的思路很有帮助，记载如下。

总结：

遇到这种情况，从三个方面入手：

(1)journalnode超时时间；

(2)JVM参数优化；

(3)yarn-site.xml和yarn-site.xml配置文件优化。

一：任务：

/datacenter/shell/user/full/bdl_social_test1.sh

跑全量数据时！Namenode挂掉！

二：报错日志：

/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log

内容：

2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137
2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138
2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67
************************************************************/
2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = yhbd01/10.171.82.67
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.0-cdh5.3.3

三：故障分析：

（1）

yhbd02上的jounalnode进程没有起来！

而Namenode在合并fsimage 和 edits 文件的时候，向jounalnode集群提交合并的文件，但是此时一直访问不了yhbd02的8485端口，导致Namenode节点访问jounalnode超时，Namenode shutdown。

（2）

然而，当NameNode切换好之后，所以JN集群都正常，继续跑任务，yhbd02上的NameNode继续挂掉了：

查了查log，应该是JN访问超时，这种问题通常是Full GC导致的问题， namenode这个时间点进行了一次时间比较长的 full gc，

导致写 journalnode 超时（默认是20s）， namenode进程退出。故需要调整mapred-site.xml参数。

一些优化JVM的思路，从JVM Dump出的文件查看出的内容：

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520
#
# JRE version: (7.0_79-b14) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.5.5
# Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

四：解决办法：

（1）

启动yhbd02上的jounalnode：

service hadoop-hdfs-jounalnode start

并启动namenode：

service hadoop-hdfs-namenode start

（2）

调整mapred-site.xml的参数：

<property>
<name>mapreduce.reduce.shuffle.memory.limit.percent</name>
<value>0.13</value>
</property>

原来是0.25现在修改为0.13，减少shuffle阶段对内存的占用，更多的使用本地。

http://www.sqlparty.com/yarn%E5%9C%A8shuffle%E9%98%B6%E6%AE%B5%E5%86%85%E5%AD%98%E4%B8%8D%E8%B6%B3%E9%97%AE%E9%A2%98error-in-shuffle-in-fetcher/

查找发现，这里的每个reducer设置的数据过大，把这句注释掉，再执行：

#set hive.exec.reducers.bytes.per.reducer=1000000000;

#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;

调节Map和Reduce的内存参数：

mapred-site.xml

mapreduce.map.memory.mb //map内存控制

mapreduce.reduce.memory.mb //reduce内存控制

（3）

http://caiguangguang.blog.51cto.com/1652935/1418895/

后续解决方法：

1）调节journalnode 的写入超时时间

dfs.qjournal.write-txns.timeout.ms

2）调整namenode 的java参数，提前触发 full gc，这样full gc 的时间就会小一些。

3）默认namenode的fullgc方式是parallel gc，是stw模式的，更改为cms的格式。调整namenode的启动参数：

-XX:+UseCompressedOops

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled

-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0

-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC

-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75

-XX:SoftRefLRUPolicyMSPerMB=0

  <property>
   <name>dfs.qjournal.start-segment.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.accept-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>
 
   <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.accept-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.finalize-segment.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.select-input-streams.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.get-journal-state.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.new-epoch.timeout.ms</name>
   <value>600000000</value>
  </property>
 
  <property>
   <name>dfs.qjournal.write-txns.timeout.ms</name>
   <value>600000000</value>
  </property>

Namenode Full GC过多，每次GC，系统停顿3-4分钟

由于namenode保存元数据在内存，所以对老生代的压力比较大，每次full gc时，系统就会卡死几分钟，解决方法如下:
(1). 使用cms gc算法
(2). 修改新生代和老生代比例为1:2,如果是1:4,会造成大对象在做ygc时，大对象直接进入老生代，造成老生代内存快速增长，full gc更加频繁。

五：延伸：

在hadoop中edits和fsimage是两个至关重要的文件，其中edits负责保存自最新检查点后命名空间的变化，起着日志的作用，而fsimage（HDFS元数据镜像文件）则保存了最新的检查点信息。这个两个文件中的内容使用普通文本编辑器是无法直接查看的，幸运的是hadoop为此准备了专门的工具用于查看文件的内容，这些工具分别为oev和oiv，可以使用hdfs调用执行。

所以可以通过以下来研究hadoop2是怎样通过edits和fsimage来实现HA的：

http://www.iteblog.com/archives/974

其步骤可以归类如下：
　　（1）、配置好HA后，客户端所有的更新操作将会写到JournalNodes节点的共享目录中，可以通过下面配置

1<property>

2　　<name>dfs.namenode.shared.edits.dir</name>

3　　<value>qjournal://XXXX/mycluster</value>

4</property>

5

6<property>

7　　<name>dfs.journalnode.edits.dir</name>

8　　<value>/export1/hadoop2x/dfs/journal</value>

9</property>

　　（2）、Active Namenode和Standby NameNode从JournalNodes的edits共享目录中同步edits到自己edits目录中；
　　（3）、Standby NameNode中的StandbyCheckpointer类会定期的检查合并的条件是否成立，如果成立会合并fsimage和edits文件；
　　（4）、Standby NameNode中的StandbyCheckpointer类合并完之后，将合并之后的fsimage上传到Active NameNode相应目录中；
　　（5）、Active NameNode接到最新的fsimage文件之后，将旧的fsimage和edits文件清理掉；
　　（6）、通过上面的几步，fsimage和edits文件就完成了合并，由于HA机制，会使得Standby NameNode和Active NameNode都拥有最新的fsimage和edits文件（之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的）

因为Active Namenode转移到了yhbd02上，而active Resource Manager还是在yhbd01上：

这样Namenode在一台机器，Resource Manager在一台机器，性能明显比以前提升很多！因为这两个进程都需要占用很大的资源。

比如，提交的job，在资源分配上更加合理，container在集群中的分布更均匀。

优化操作开始：

1：修改了mapred-site.xml

2：修改了hadoop-env.xml

针对Full GC的问题：

解决思路是：

将用户全量语句。拆解成以年/月为单位，来跑数据！

将journal node延时时间调长

调节journalnode 的写入超时时间

dfs.qjournal.write-txns.timeout.ms

http://blog.csdn.net/qq1010885678/article/details/46673079

调节之后重启journalnode集群。

然后namenode节点不再因为连不上journalnode集群报错，而是因为以下信息：

IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error

切换到namenode2上之后稳定运行！

继续优化JVM。

journalnode的超时时间优化后的报错：

2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54
2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56
2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630
2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error
2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s).
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error
2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception

找错误原因：

step1:

org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds

源码位置：

org.apache.hadoop.hdfs.server.blockmanagement;

说明：

Scans the namesystem, scheduling blocks to be cached as appropriate.

The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.

step2：报错开始！

org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error

之后修改yarn-site.xml和mapred-site.xml参数，Namenode不再挂掉！

修改

yarn-site.xml

<property>
               <description>The minimum allocation for every container request at the RM,
                                                    in MBs. Memory requests lower than this won't take effect,
                   and the specified value will get allocated at minimum.</description>
               <name>yarn.scheduler.minimum-allocation-mb</name>
               <value>1500</value>
        </property>

        <property>
               <description>The maximum allocation for every container request at the RM,
                  in MBs. Memory requests higher than this won't take effect,
                   and will get capped to this value.</description>
               <name>yarn.scheduler.maximum-allocation-mb</name>
               <value>8192</value>
        </property>

mapred-site.xml

<property>
        <name>mapreduce.map.memory.mb</name>
        <value>1500</value>
        </property>

        <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1024M</value>
        <description>Larger heap-size for child jvms of maps.</description>
        </property>

        <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>2100</value>
        </property>

        <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx1900M</value>
        <description>Larger heap-size for child jvms of reduces.</description>
        </property>

        <property>
                <name>mapreduce.job.counters.limit</name>
                <value>12000</value>
        </property>

最后总结一下：

遇到这种情况，从三个方面入手：

(1)journalnode超时时间；

(2)JVM参数优化；

(3)yarn-site.xml和yarn-site.xml配置文件优化。

阅读全文

0 0