hbase regionserver总出现自动down的情况排查

来源:互联网 发布:老子化胡 知乎 编辑:程序博客网 时间:2024/06/09 14:21

最近在调试hbase,10台节点,服务正常后,写入数据,总是出现regionserver自动down的情况,查看日志如下:

2016-05-04 13:29:09,690 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/oldWALs/ma7.cloud%2C16020%2C1461926336242.default.1462336775368 (inode 294646): File is not open for writing. Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3454)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3354)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:823)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod

(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolTranslatorPB.java:393)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
    at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1028)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
2016-05-04 13:29:09,692 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: Shutdown / close of WAL failed: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /apps/hbase/data/oldWALs/ma7.cloud%2C16020%2C1461926336242.default.1462336775368 (inode 294646): File is not open for writing.

Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3454)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3354)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:823)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod

(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020 closing leases
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020 closed leases
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] hbase.ChoreService: Chore service for: ma7.cloud,16020,1461926336242 had [[ScheduledChore: Name: ma7.cloud,16020,1461926336242-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region ma7.cloud,16020,1461926336242 Period: 120000 Unit: MILLISECONDS]] on shutdown
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2016-05-04 13:29:09,703 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:10,703 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:12,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:16,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:18,007 INFO  [regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker closing leases
2016-05-04 13:29:18,008 INFO  [regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker closed leases
2016-05-04 13:29:24,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:24,704 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2016-05-04 13:29:24,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.ZKUtil: regionserver:16020-0x15460f0ceb70046, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, baseZNode=/hbase-unsecure Unable to list children of znode /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:454)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:482)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1461)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1383)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1265)
    at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:187)
    at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:292)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:180)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:172)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2137)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:24,705 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x15460f0ceb70046, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:454)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:482)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1461)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1383)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1265)
    at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:187)
    at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:292)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:180)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:172)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2137)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:24,705 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] ipc.RpcServer: Stopping server on 16020
2016-05-04 13:29:24,705 INFO  [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping
2016-05-04 13:29:24,706 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2016-05-04 13:29:24,706 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2016-05-04 13:29:24,706 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:25,706 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:27,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:31,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:39,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:39,707 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2016-05-04 13:29:39,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1221)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1210)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1403)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:39,708 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: stopping server ma7.cloud,16020,1461926336242; zookeeper connection closed.
2016-05-04 13:29:39,708 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: regionserver/ma7.cloud/192.168.1.46:16020 exiting
2016-05-04 13:29:39,708 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651)
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@7a7471ce
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Shutdown hook finished.


分析:

看着像是跟Zookeeper有关系,又了看监控,发现内存有时候降为0,网络的流量比较大,应该是在写入数据,这个问题网上需要调整jvm参数


从ambari上修改:

修改前:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xmn{{regionserver_xmn_size}} -XX:CMSInitiatingOccupancyFraction=70  -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"
修改后:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=8 -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:InitiatingHeapOccupancyPercent=75 -XX:NewRatio=39 -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"

修改前:
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log -Djava.io.tmpdir={{java_io_tmpdir}}"
修改后:
export HBASE_OPTS="$HBASE_OPTS -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log -Djava.io.tmpdir={{java_io_tmpdir}}"


解释:堆大小调整为40G,新生代1G,回收算法使用G1。
-XX:NewRatio=39
是新生代和其他的老年代、持久代的比例

1/(39+1) * 40 G
默认的CMS算法 总出现异常 导致regionserver自杀

-Xmn{{regionserver_xmn_size}} 是配置新生代的
有可能在G1中不适用了,删除掉


参考:

http://www.cnblogs.com/chengxin1982/p/3818448.html

http://www.cnblogs.com/zhenjing/archive/2012/11/13/hbase_is_OK.html

0 0
原创粉丝点击