hbase hregion server 异常退出排查

来源：互联网发布：mac beta 编辑：程序博客网时间：2024/06/16 04:47

---HBASE日志
2017-10-25 23:01:15,788 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2907ms
GC pool 'ParNew' had collection(s): count=1 time=3312ms
2017-10-25 23:01:29,859 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4569ms
GC pool 'ParNew' had collection(s): count=1 time=4746ms
2017-10-25 23:01:37,146 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2285ms
GC pool 'ParNew' had collection(s): count=1 time=2530ms
2017-10-25 23:01:44,387 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5240ms
GC pool 'ParNew' had collection(s): count=1 time=5417ms

2017-10-25 23:02:06,587 WARN [regionserver60020] util.Sleeper: We slept 22194ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:06,588 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20200ms
GC pool 'ParNew' had collection(s): count=1 time=20520ms

2017-10-25 23:02:06,587 WARN [regionserver60020] util.Sleeper: We slept 22194ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:06,588 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20200ms
GC pool 'ParNew' had collection(s): count=1 time=20520ms
2017-10-25 23:02:06,580 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 22193ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:06,594 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 22197ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:06,596 INFO [regionserver60020-SendThread(bis-backup-s-01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 29326ms for sessionid 0x505e897ce7f9032d, closing socket connection and attempting reconnect
2017-10-25 23:02:06,595 INFO [regionserver60020-SendThread(bis-hadoop-datanode-s-01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 29374ms for sessionid 0x505e897ce7f9032e, closing socket connection and attempting reconnect

2017-10-25 23:02:06,659 WARN [RpcServer.handler=13,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":20589,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"10.10.10.51:34518","starttimems":1508943706054,"queuetimems":22,"class":"HRegionServer","responsesize":791,"method":"Multi"}
2017-10-25 23:02:06,962 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Opening socket connection to server bis-hadoop-namenode-s-01/10.10.10.82:2181. Will not attempt to authenticate using SASL (unknown error)
2017-10-25 23:02:06,963 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Socket connection established to bis-hadoop-namenode-s-01/10.10.10.82:2181, initiating session
2017-10-25 23:02:06,964 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Session establishment complete on server bis-hadoop-namenode-s-01/10.10.10.82:2181, sessionid = 0x505e897ce7f9032e, negotiated timeout = 40000
2017-10-25 23:02:07,014 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Opening socket connection to server bis-hadoop-namenode-s-01/10.10.10.82:2181. Will not attempt to authenticate using SASL (unknown error)
2017-10-25 23:02:07,020 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Socket connection established to bis-hadoop-namenode-s-01/10.10.10.82:2181, initiating session
2017-10-25 23:02:07,021 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Session establishment complete on server bis-hadoop-namenode-s-01/10.10.10.82:2181, sessionid = 0x505e897ce7f9032d, negotiated timeout = 40000
2017-10-25 23:02:58,896 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 51874ms for sessionid 0x505e897ce7f9032d, closing socket connection and attempting reconnect
2017-10-25 23:02:58,896 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 52300ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:58,896 INFO [regionserver60020-SendThread(bis-hadoop-namenode-s-01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 51932ms for sessionid 0x505e897ce7f9032e, closing socket connection and attempting reconnect
2017-10-25 23:02:58,898 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 50309ms
GC pool 'ParNew' had collection(s): count=1 time=50566ms
2017-10-25 23:02:58,898 WARN [regionserver60020] util.Sleeper: We slept 52273ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:58,907 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 52311ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-10-25 23:02:58,910 WARN [ResponseProcessor for block BP-267413815-16.6.10.82-1400466369233:blk_1317254303_243853019] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-267413815-16.6.10.82-1400466369233:blk_1317254303_243853019
java.io.EOFException: Premature EOF: no length prefix available

2017-10-25 23:02:59,088 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server bis-hadoop-datanode-s-01,60020,1508493905842: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing bis-hadoop-datanode-s-01,60020,1508493905842 as dead server

2017-10-25 23:02:59,369 FATAL [regionserver60020.logRoller] regionserver.HRegionServer: ABORTING region server bis-hadoop-datanode-s-01,60020,1508493905842: IOE in log roller
java.io.IOException: cannot get log writer
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase/WALs/bis-hadoop-datanode-s-01,60020,1508493905842

2017-10-25 23:02:59,424 FATAL [RpcServer.handler=26,port=60020] wal.FSHLog: Could not sync. Requesting roll of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-267413815-16.6.10.82-1400466369233:blk_1317254303_243853019 does not exist or is not under Constructionblk_1317254303_243853123

----zookeeper server 日志
2017-10-25 23:02:07,022 [myid:82] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x505e897ce7f9032d with negotiated timeout 40000 for client /10.10.10.80:40602

2017-10-25 23:02:48,000 [myid:82] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x505e897ce7f9032d, timeout of 40000ms exceeded
2017-10-25 23:02:48,000 [myid:82] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x505e897ce7f9032e, timeout of 40000ms exceeded
2017-10-25 23:02:48,001 [myid:82] - INFO [ProcessThread(sid:82 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x505e897ce7f9032d
2017-10-25 23:02:48,004 [myid:82] - INFO [ProcessThread(sid:82 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x505e897ce7f9032e
2017-10-25 23:02:48,013 [myid:82] - INFO [CommitProcessor:82:NIOServerCnxn@1001] - Closed socket connection for client /10.10.10.80:40602 which had sessionid 0x505e897ce7f9032d

2017-10-25 23:02:48,013 [myid:82] - INFO [CommitProcessor:82:NIOServerCnxn@1001] - Closed socket connection for client /10.10.10.80:40601 which had sessionid 0x505e897ce7f9032e

分析hbase和zk日志：

发现23点01-03分左右出现好几次大的gc,时间都在20s，最后一次在52s左右，期间服务全被堵塞了，无法响应外部请求；由于超出了zk允许的最大中断时间（40s)，zk服务自动剔除该hbase节点。hbase gc后试图继续与zk连续，但得不到响应，故退出服务。

阅读全文

0 0