hbase全分布式遇到的问题1--忘记关闭防火墙

来源:互联网 发布:知世鼓励小狼 编辑:程序博客网 时间:2024/04/20 05:31

近两天学习hbase的全分布式搭建,因为好几个地方疏忽,几个问题同时出现,着实费了好些时间才理清,为了方便理解,问题解决后每次只重现一个错误,分别记录,本篇是关于防火墙的。

之前记得在学hadoop的时候所有节点的防火墙就已经关好了,所以这个问题刚开始的时候压根就没往上考虑过,上网查了好久发现有相同经历的文章才去核实。

现象:start-abase.sh执行能看到hmaster进程打开,但是用web UI访问不了 http://<masternode>:16010(我这里masternode是hadoop.lsd1.com,后续不再重述);并且一段时间后所有节点的hmaster和hregionserver都挂掉;查master节点的日志hba se-root-master-hadoop.lsd1.com.log有如下错误:

2017-03-13 12:02:29,850 INFO  [main-SendThread(hadoop.lsd3.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadoop.lsd3.com/192.168.56.13:2181. Will not attempt to authenticate using SASL (unknown error)

2017-03-13 12:02:29,851 WARN  [main-SendThread(hadoop.lsd3.com:2181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

java.net.NoRouteToHostException: 没有到主机的路由

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712)

at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

2017-03-13 12:02:30,934 INFO  [main-SendThread(hadoop.lsd1.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadoop.lsd1.com/192.168.56.11:2181. Will not attempt to authenticate using SASL (unknown error)

2017-03-13 12:02:30,935 INFO  [main-SendThread(hadoop.lsd1.com:2181)] zookeeper.ClientCnxn: Socket connection established to hadoop.lsd1.com/192.168.56.11:2181, initiating session

2017-03-13 12:02:30,938 INFO  [main-SendThread(hadoop.lsd1.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

2017-03-13 12:02:31,038 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper create failed after 4 attempts

2017-03-13 12:02:31,040 ERROR [main] master.HMasterCommandLine: Master exiting

java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster. 

at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2426)

at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:231)

at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:137)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)

at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2436)

Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: master:160000x0, quorum=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181, baseZNode=/hbase Unexpected KeeperException creating base node

at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:206)

at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:187)

at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:585)

at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:381)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:408)

at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2419)

... 5 more

Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase

at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:565)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:544)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1204)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1182)

at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:194)

... 13 more

[root@hadoop hbase-1.2.4]# 


这看起来是zookeeper在通信的时候遇到问题了,其实像这类网络问题可能的原因应该不止是防火墙没关,这里我不好说,只能说就我遇到的来讲,防火墙算是一种原因;后续的实验中把其他节点的防火墙关闭掉,单单打开某一个regionserver节点的防火墙,也会导致全部的hmaster和hregionserver挂掉(这有点费解,难道hbase不允许节点故障吗?);如果是启动的时候防火墙关闭,在启动成功后再打开某个节点的防火墙(包括master节点也是),却并不会导致集群退出,只是该节点无法访问,并且在再次关闭该节点后都是可以恢复访问的;


解决方法:

在所有的节点核实一下防火墙是否关闭

service frewalld status

如果没有关闭,关闭掉,并且禁用

systemctl stop firewalls.service

systemctl disable firewalls.service

然后重启hbase,这个问题已经解决,web UI能访问到16010端口了,剩下还有其他的问题另开一篇描述。

总结:如果start-hbase.sh运行后能用jps查看到hmaster进程正常,但是用web UI又访问不到master,并且查看日志有类似上述的connect 错误时,可以考虑一下是否防火墙没关好;当然hmaster和hregionserver进程也很可能在过一小段时间后全挂掉,所以主要还是要查看日志来判断。


0 0