Hive Job 故障,导致yhbd01 Namenode Shutdown

来源:互联网 发布:临沂百度关键词优化 编辑:程序博客网 时间:2024/05/01 05:04

转载自:http://rocketeer.leanote.com/post/Hive-Job-%E6%95%85%E9%9A%9C%EF%BC%8C%E5%AF%BC%E8%87%B4yhbd01-Namenode-Shutdown

这是一篇刚来有货优化集群时的记载。个人认为对于新手优化集群的思路很有帮助,记载如下。

总结:

遇到这种情况,从三个方面入手:
(1)journalnode超时时间;
(2)JVM参数优化;
(3)yarn-site.xml和yarn-site.xml配置文件优化。

 

 

一:任务:

/datacenter/shell/user/full/bdl_social_test1.sh
跑全量数据时!Namenode挂掉!

二:报错日志:
/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log
内容:
2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137
2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138
2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67
************************************************************/
2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = yhbd01/10.171.82.67
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.0-cdh5.3.3


三:故障分析:
(1)
yhbd02上的jounalnode进程没有起来!
而Namenode在合并fsimage 和 edits 文件的时候,向jounalnode集群提交合并的文件,但是此时一直访问不了yhbd02的8485端口,导致Namenode节点访问jounalnode超时,Namenode shutdown。
(2)
然而,当NameNode切换好之后,所以JN集群都正常,继续跑任务,yhbd02上的NameNode继续挂掉了:
查了查log,应该是JN访问超时,这种问题通常是Full GC导致的问题, namenode这个时间点进行了一次时间比较长的 full gc,
导致写 journalnode 超时(默认是20s), namenode进程退出。故需要调整mapred-site.xml参数。

一些优化JVM的思路,从JVM Dump出的文件查看出的内容:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520
#
# JRE version:  (7.0_79-b14) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.5.5
# Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again


四:解决办法:
(1)
启动yhbd02上的jounalnode:
service hadoop-hdfs-jounalnode start
并启动namenode:
service hadoop-hdfs-namenode start
(2)
调整mapred-site.xml的参数:
  <property>
        <name>mapreduce.reduce.shuffle.memory.limit.percent</name>
        <value>0.13</value>
   </property>
  原来是0.25现在修改为0.13,减少shuffle阶段对内存的占用,更多的使用本地。
http://www.sqlparty.com/yarn%E5%9C%A8shuffle%E9%98%B6%E6%AE%B5%E5%86%85%E5%AD%98%E4%B8%8D%E8%B6%B3%E9%97%AE%E9%A2%98error-in-shuffle-in-fetcher/ 

查找发现,这里的每个reducer设置的数据过大,把这句注释掉,再执行:
#set hive.exec.reducers.bytes.per.reducer=1000000000;
#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;

调节Map和Reduce的内存参数:
mapred-site.xml
mapreduce.map.memory.mb      //map内存控制       
mapreduce.reduce.memory.mb  //reduce内存控制
(3)
http://caiguangguang.blog.51cto.com/1652935/1418895/ 

后续解决方法:

1)调节journalnode 的写入超时时间

1
dfs.qjournal.write-txns.timeout.ms

2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。

3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:

1
2
3
4
5
6
-XX:+UseCompressedOops 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
 -XX:SoftRefLRUPolicyMSPerMB=0
  1.   <property>
  2.    <name>dfs.qjournal.start-segment.timeout.ms</name>
  3.    <value>600000000</value>
  4.   </property>
  5.  
  6.   <property>
  7.    <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
  8.    <value>600000000</value>
  9.   </property>
  10.  
  11.   <property>
  12.    <name>dfs.qjournal.accept-recovery.timeout.ms</name>
  13.    <value>600000000</value>
  14.   </property>
  15.  
  16.    <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
  17.    <value>600000000</value>
  18.   </property>
  19.  
  20.   <property>
  21.    <name>dfs.qjournal.accept-recovery.timeout.ms</name>
  22.    <value>600000000</value>
  23.   </property>
  24.  
  25.   <property>
  26.    <name>dfs.qjournal.finalize-segment.timeout.ms</name>
  27.    <value>600000000</value>
  28.   </property>
  29.  
  30.   <property>
  31.    <name>dfs.qjournal.select-input-streams.timeout.ms</name>
  32.    <value>600000000</value>
  33.   </property>
  34.  
  35.   <property>
  36.    <name>dfs.qjournal.get-journal-state.timeout.ms</name>
  37.    <value>600000000</value>
  38.   </property>
  39.  
  40.   <property>
  41.    <name>dfs.qjournal.new-epoch.timeout.ms</name>
  42.    <value>600000000</value>
  43.   </property>
  44.  
  45.   <property>
  46.    <name>dfs.qjournal.write-txns.timeout.ms</name>
  47.    <value>600000000</value>
  48.   </property>

Namenode Full GC过多,每次GC,系统停顿3-4分钟

  由于namenode保存元数据在内存,所以对老生代的压力比较大,每次full gc时,系统就会卡死几分钟,解决方法如下:
  (1). 使用cms gc算法
  (2). 修改新生代和老生代比例为1:2,如果是1:4,会造成大对象在做ygc时,大对象直接进入老生代,造成老生代内存快速增长,full gc更加频繁。



五:延伸:
  在hadoop中edits和fsimage是两个至关重要的文件,其中edits负责保存自最新检查点后命名空间的变化,起着日志的作用,而fsimage(HDFS元数据镜像文件)则保存了最新的检查点信息。这个两个文件中的内容使用普通文本编辑器是无法直接查看的,幸运的是hadoop为此准备了专门的工具用于查看文件的内容,这些工具分别为oev和oiv,可以使用hdfs调用执行。

所以可以通过以下来研究hadoop2是怎样通过edits和fsimage来实现HA的:
http://www.iteblog.com/archives/974 

其步骤可以归类如下:
  (1)、配置好HA后,客户端所有的更新操作将会写到JournalNodes节点的共享目录中,可以通过下面配置

1<property>
2  <name>dfs.namenode.shared.edits.dir</name>
3  <value>qjournal://XXXX/mycluster</value>
4</property>
5 
6<property>
7  <name>dfs.journalnode.edits.dir</name>
8  <value>/export1/hadoop2x/dfs/journal</value>
9</property>

  (2)、Active Namenode和Standby NameNode从JournalNodes的edits共享目录中同步edits到自己edits目录中;
  (3)、Standby NameNode中的StandbyCheckpointer类会定期的检查合并的条件是否成立,如果成立会合并fsimage和edits文件;
  (4)、Standby NameNode中的StandbyCheckpointer类合并完之后,将合并之后的fsimage上传到Active NameNode相应目录中;
  (5)、Active NameNode接到最新的fsimage文件之后,将旧的fsimage和edits文件清理掉;
  (6)、通过上面的几步,fsimage和edits文件就完成了合并,由于HA机制,会使得Standby NameNode和Active NameNode都拥有最新的fsimage和edits文件(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)


因为Active Namenode转移到了yhbd02上,而active Resource Manager还是在yhbd01上:
这样Namenode在一台机器,Resource Manager在一台机器,性能明显比以前提升很多!因为这两个进程都需要占用很大的资源。
比如,提交的job,在资源分配上更加合理,container在集群中的分布更均匀。




优化操作开始:
1:修改了mapred-site.xml    
2:修改了hadoop-env.xml

针对Full GC的问题:
解决思路是:
将用户全量语句。拆解成以年/月为单位,来跑数据!
将journal node延时时间调长

调节journalnode 的写入超时时间

dfs.qjournal.write-txns.timeout.ms
http://blog.csdn.net/qq1010885678/article/details/46673079 
调节之后重启journalnode集群。
然后namenode节点不再因为连不上journalnode集群报错,而是因为以下信息:
IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
切换到namenode2上之后稳定运行!
继续优化JVM。


journalnode的超时时间优化后的报错:
2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54
2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56
2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630
2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error
2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s).
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error
2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception


找错误原因:
step1:
 org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
源码位置:
org.apache.hadoop.hdfs.server.blockmanagement;
说明:
Scans the namesystem, scheduling blocks to be cached as appropriate.
The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.

step2:报错开始!
org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error


之后修改yarn-site.xml和mapred-site.xml参数,Namenode不再挂掉!


修改
yarn-site.xml
      <property>
               <description>The minimum allocation for every container request at the RM,
                                                    in MBs. Memory requests lower than this won't take effect,
                   and the specified value will get allocated at minimum.</description>
               <name>yarn.scheduler.minimum-allocation-mb</name>
               <value>1500</value>
        </property>

        <property>
               <description>The maximum allocation for every container request at the RM,
                  in MBs. Memory requests higher than this won't take effect,
                   and will get capped to this value.</description>
               <name>yarn.scheduler.maximum-allocation-mb</name>
               <value>8192</value>
        </property>

mapred-site.xml
         <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1500</value>
        </property>

        <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1024M</value>
        <description>Larger heap-size for child jvms of maps.</description>
        </property>

        <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>2100</value>
        </property>

        <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx1900M</value>
        <description>Larger heap-size for child jvms of reduces.</description>
        </property>

        <property>
                <name>mapreduce.job.counters.limit</name>
                <value>12000</value>
        </property>

最后总结一下:
遇到这种情况,从三个方面入手:
(1)journalnode超时时间;
(2)JVM参数优化;
(3)yarn-site.xml和yarn-site.xml配置文件优化。
原创粉丝点击