Hadoop HA QJM 与RM的自动容灾 以及hbase与Hadoop HA的 配置ha后hive修改

来源:互联网 发布:怎样在淘宝上传宝贝 编辑:程序博客网 时间:2024/05/29 17:43








------------------------------------------------------------------------------------------------------------------------------------






初始化编辑日志如果上锁就把in_lock给删掉




RM的自动容灾-----------------------------------------------------------------------------------------


【yarn-site.xml】


在两台主机分别启动:yarn-daemon.sh start Resourcemanager

hbase与hadoop HA的集成:

之后要想运行hbase的MR,还要修改【hbase-site.xml】


在这之前要把xcall rm -rf /home/ubuntu/zk/version-2删掉([zoo.cfg]的dataDir)

修改【hbase-site.xml】




$s100>zkCli.sh -server s101:2181


hbase  Hmaster  HA----------------

只需要在想要启动的节点上运行:$>hbase-daemon.sh start master


修改hive:hdfs名称变成逻辑名之后,以前集群下创建的表就不能用了,


基本上改了之后也不好使,查看hdfs dfs -lsr /

删除原来的hive数据库,重新创建一个

如果没有/user/hive目录,需要删除临时目录重新初始化元数据库:

schematool -initSchema -dbType mysql


------------------------------------------------------------------------------------------------------------------------------------

近日,在搭建Hadoop HA QJM集群的时候,出现一个问题,如本文标题。

网上有很多HA的博文,其实比较好的博文就是官方文档,讲的已经非常详细。所以,HA的搭建这里不再赘述。

本文就想给出一篇org.apache.hadoop.ipc.Client: Retrying connect to server错误的解决的方法。

因为在搜索引擎中输入了错误问题,没有找到一篇解决问题的。这里写一篇备忘,也可以给出现同样问题的朋友一个提示。


一、问题描述

HA按照规划配置好,启动后,NameNode不能正常启动。刚启动的时候 jps 看到了NameNode,但是隔了一两分钟,再看NameNode就不见了。

但是测试之后,发现下面2种情况:

1)先启动JournalNode,再启动Hdfs,NameNode可以启动并可以正常运行

2)使用start-dfs.sh启动,众多服务都启动了,隔两分钟NameNode会退出,再次hadoop-daemon.sh start namenode单独启动可以成功稳定运行NameNode。


再看NameNode的日志,不要嫌日志长,其实出错的蛛丝马迹都包含其中了,如下:

2016-03-09 10:50:27,123 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = node1/192.168.56.201

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 2.5.1

STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'root' on 2014-10-20T05:53Z

STARTUP_MSG:   java = 1.7.0_09

************************************************************/

2016-03-09 10:50:27,132 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

2016-03-09 10:50:27,138 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []

2016-03-09 10:50:27,465 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties

2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).

2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started

2016-03-09 10:50:27,625 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://hadoopha

2016-03-09 10:50:27,626 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use hadoopha to access this namenode/service.

2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting web server as: ${dfs.web.authentication.kerberos.principal}

2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://node1:50070

2016-03-09 10:50:28,121 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog

2016-03-09 10:50:28,128 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined

2016-03-09 10:50:28,145 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs

2016-03-09 10:50:28,209 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)

2016-03-09 10:50:28,211 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*

2016-03-09 10:50:28,268 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50070

2016-03-09 10:50:28,269 INFO org.mortbay.log: jetty-6.1.26

2016-03-09 10:50:28,580 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: 'signature.secret' configuration not set, using a random value as secret

2016-03-09 10:50:28,648 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@node1:50070

2016-03-09 10:50:28,687 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!

2016-03-09 10:50:28,741 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair:true

2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000

2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true

2016-03-09 10:50:28,805 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000

2016-03-09 10:50:28,807 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2016 Mar 09 10:50:28

2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap

2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: 2.0% max memory 966.7 MB = 19.3 MB

2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication         = 3

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication             = 512

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication             = 1

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams      = 2

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer        = false

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog          = 1000

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup          = supergroup

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true

2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Determined nameservice ID: hadoopha

2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: true

2016-03-09 10:50:28,866 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: 1.0% max memory 966.7 MB = 9.7 MB

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: capacity      = 2^20 = 1048576 entries

2016-03-09 10:50:29,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: 0.25% max memory 966.7 MB = 2.4 MB

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: capacity      = 2^18 = 262144 entries

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000

2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled

2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: capacity      = 2^15 = 32768 entries

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: ACLs enabled? false

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: XAttrs enabled? true

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: Maximum size of an xattr: 16384

2016-03-09 10:50:29,208 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/hadoop/tmp/dfs/name/in_use.lock acquired by nodename 4394@node1

2016-03-09 10:50:29,610 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded

2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:32,055 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:50:35,807 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

此处省去重复的N行

2016-03-09 10:50:39,812 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,069 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

 at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)

 at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)

 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)

 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)

 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:636)

 at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:279)

 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:955)

 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:700)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1473)

2016-03-09 10:50:40,071 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected.

2016-03-09 10:50:40,116 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1 INodes.

2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.

2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 0 from /home/hadoop/hadoop/tmp/dfs/name/current/fsimage_0000000000000000000

2016-03-09 10:50:40,184 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)

2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups

2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 10986 msecs

2016-03-09 10:50:40,408 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to node1:8020

2016-03-09 10:50:40,414 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue

2016-03-09 10:50:40,429 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020

2016-03-09 10:50:40,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean

2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0

2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 11 secs

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks

2016-03-09 10:50:40,536 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting

2016-03-09 10:50:40,539 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting

2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: node1/192.168.56.201:8020

2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state

2016-03-09 10:50:40,545 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at node5/192.168.56.205:8020 every 120 seconds.

2016-03-09 10:50:40,550 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...

Checkpointing active NN at http://node5:50070

Serving checkpoints at http://node1:50070

2016-03-09 10:50:41,551 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:50:50,557 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:50,561 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,676 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,677 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

 at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)

 at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)

 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)

 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)

 at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)

2016-03-09 10:50:50,677 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state

2016-03-09 10:50:50,678 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted

java.lang.InterruptedException: sleep interrupted

 at java.lang.Thread.sleep(Native Method)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)

 at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)

 at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)

2016-03-09 10:50:50,682 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state

2016-03-09 10:50:50,684 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded

2016-03-09 10:50:50,690 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...

2016-03-09 10:50:51,698 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:51:00,715 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485], stream=null))

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

 at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:182)

 at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:436)

 at org.apache.hadoop.hdfs.server.namenode.JournalSet$7.apply(JournalSet.java:590)

 at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)

 at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:587)

 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1361)

 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1068)

 at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1624)

 at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)

 at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)

 at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)

 at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1502)

 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1197)

 at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)

 at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)

 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)

 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)

 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:415)

 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

2016-03-09 10:51:00,717 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2016-03-09 10:51:00,718 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.56.201

************************************************************/


二、问题分析

看着日志很长,来分析一下,注意看日志中使用颜色突出的部分。

可以肯定NameNode不能正常运行,不是配置错了,而是不能连接上JournalNode、

查看JournalNode的日志没有问题,那么问题就在JournalNode的客户端NameNode。

2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

来分析上句的日志:

NameNode作为JournalNode的客户端发起连接请求,但是失败了,然后NameNode又向其他节点依次发起了请求都失败了,直至到了最大重试次数。


通过实验知道,先启动JournalNode或者再次启动NameNode就可以了,说明JournalNode并没有准备好,而NameNode已经用完了所有重试次数。


三、解决办法

修改core-site.xml中的ipc参数

 <property>

   <name>ipc.client.connect.max.retries</name>

   <value>100</value>

   <description>Indicates the number of retries a client will make to establish

       a server connection.

   </description>

 </property>

 <property>

   <name>ipc.client.connect.retry.interval</name>

   <value>10000</value>

   <description>Indicates the number of milliseconds a client will wait for

  before retrying to establish a server connection.

   </description>

 </property>


Namenode向JournalNode发起的ipc连接请求的重试间隔时间和重试次数,我的虚拟机集群实验大约需要2分钟,NameNode即可连接上JournalNode。连接后很稳定。


注意:仅对于这种由于服务没有启动完成造成连接超时的问题,都可以调整core-site.xml中的ipc参数来解决。如果目标服务本身没有启动成功,这边调整ipc参数是无效的。


一.Hdfs HA

       NN       DN      ZK     JN     ZKFC

s100   1                 1              1

s101   1        1        1        1      1

s102             1        1        1

s103             1                 1

 

 

1.     免密钥(四台相互免密钥)

ssh-keygen -trsa -P '' -f ~/.ssh/id_rsa

                            ssh-copy-id s100

                            ssh-copy-id s101

                            ssh-copy-id s102

ssh-copy-id  s103

2.     安装jdk

[/etc/environment]

JAVA_HOME=jdk

Path=”$Path:jdk/bin”

 

3.     安装hadoop

Hadoop安装包上传 解压 配置环境变量

 

4.        修改配置slaves

[/hadoop-2.7.0/etc/hadoop/slaves]

s101

s102

s103

发送到其他三台主机上面

5.安装zookeeper集群

1.在s100上面上传zk压缩包

>cp /mnt/hgfs/BigData/第九天/zookeeper-3.4.10.tar.gz/data/

解压

>cd /data

>tar -xzvf zookeeper-3.4.10.tar.gz

 

2.发送到其他节点

>xsync /data/zookeeper-3.4.10

 

3.配置环境

[/etc/environment]

ZOOKEEPER_HOME=/data/zookeeper-3.4.10

Path=”$Path:/data/zookeeper-3.4.10/bin”

 

4.重启

>sudo reboot

 

5.测试

>zk

 

6.        修改配置文件

>cd/data/zookeeper-3.4.10/conf

>cp zoo_sample.cfg zoo.cfg

>sudo nano zoo.cfg

dataDir=/home/neworigin/zookeeper

                                    server.1=s100:2888:3888

                                server.2=s101:2888:3888

                                server.3=s102:2888:3888

 

server.X=host:port1:port2 的意思, X 表示当前 host 所运行的服务的zookeeper 服务的 id(在接下来填写 myid 时需要用到), port1 表示
zookeeper 中的 follower 连接到 leader 的端口号, port2 表示 leadership 时所用的端口号。

 

7.发送配置文件

>xsync zoo.cfg

 

8.        创建文件夹

>xcall mkdir -p /home/neworigin/zookeeper

 

>cd /home/neworigin/zookeeper

[s100]

>echo 1 > myid

[s101]

echo 2 > myid

[s102]

echo 3 > myid

 

文件内容为 zoo.cfg中 master 所对应的 server.X。

 

9.        启动集群(s100、s101、s102一起启动zk,尽量避免启动时间差距大)

>zkServer.sh start

>xcall jps查看进程

------------s100-----------------

3862 QuorumPeerMain

4351 Jps

------------s101-----------------

3650 QuorumPeerMain

3980 Jps

------------s102-----------------

3961 Jps

3791 QuorumPeerMain

------------s103-----------------

3883 Jps

 

[查看状态 leader or follower]

                           >zkServer.sh status

[暂停]

>zkServer.sh stop

 

10.     一台服务器上 都要删除之前Hadoop的目录 /home/neworigin/hadoop,重新创建

>xcall rm -rf /h ome/neworigin/hadoop

>xcall mkdir -p /home/neworigin/hadoop

 

 

11.    修改hadoop配置

[/data/hadoop-2.7.0/etc/hadoop/hdfs-site.xml]

       <property>          -----》配置nameservice

                <name>dfs.nameservices</name>

                <value>neworigin</value>

</property>    

<property>

<name>dfs.ha.namenodes.neworigin</name>

   <value>nn1,nn2</value>

</property>

<property>         -----》RPC请求地址

                           <name>dfs.namenode.rpc-address.neworigin.nn1</name>

                           <value>s100:9000</value>

                    </property>

                                    <property>

                         <name>dfs.namenode.rpc-address.neworigin.nn2</name>

                         <value>s101:9000</value>

                   </property>

<property>         -----》指定http地址

                         <name>dfs.namenode.http-address.neworigin.nn1</name>

                         <value>s100:50070</value>

                   </property>

<property>

                         <name>dfs.namenode.http-address.neworigin.nn2</name>

                         <value>s101:50070</value>

                    </property>

                                   <property>        -------》配置JN集群

                         <name>dfs.namenode.shared.edits.dir</name>

                         <value>qjournal://s101:8485;s102:8485;s103:8485/neworigin</value>

                    </property>

<property>       -------》配置代理故障迁移

                           <name>dfs.client.failover.proxy.provider.neworigin</name>

                                                                                                                       <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

                          </property>

<property>        ------》免密钥私钥文件

                            <name>dfs.ha.fencing.methods</name>

                            <value>sshfence</value>

                       </property>

                                    <property>

                              <name>dfs.ha.fencing.ssh.private-key-files</name>

                              <value>/home/neworigin/.ssh/id_dsa</value>(如果是第二种免密钥方法用/root/.ssh/id_rsa)

                       </property>

<property>      ------》JN中edits存放目录(格式化是会帮助自动创建)

     <name>dfs.journalnode.edits.dir</name>

      <value>/home/neworigin/journal/data</value>

</property>

                  

                                     <property>  -----》启动高可用的故障迁移,不写默认false

                           <name>dfs.ha.automatic-failover.enabled</name>

                           <value>true</value>

                            </property>

 

 

[core-site.xml]

                                             <property>

                                    <name>fs.defaultFS</name>

                                   <value>hdfs://neworigin</value>

                               </property>

 

                                          <property>

                                    <name>hadoop.tmp.dir</name>

                                      <value>/home/neworigin/hadoop</value>

                               </property>

        

                                         <property>

                           <name>ha.zookeeper.quorum</name>

                           <value>s100:2181,s101:2181,s102:2181</value>

                       </property>

 

 

 

 

 

 

(1) dfs.nameservices

HDFS 命名服务的逻辑名称,可用户自己定义,比如 mycluster,注意,该名称将被基于 HDFS 的系统使用,比如 Hbase 等,此外,需要你想启用 HDFS Federation,可以通过该参数指定多个逻辑名称,并用“,”分割。

(2) dfs.ha.namenodes.[$nameservice ID]:

某个命名服务下包含的NameNode 列表,可为每个 NameNode 指定一个自定义的 ID 名称,比如命名服务 nn 下有两个 NameNode,分别命名为 nn1 和 nn2

(3)dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]

为每个 NameNode 设置 RPC 地址

(4) dfs.namenode.http-address.[$nameservice ID].[$name node ID]

为每个 NameNode 设置对外的 HTTP 地址

(5) dfs.namenode.shared.edits.dir

设置一组journalNode 的 URI 地址,active NameNode 将 edit log 写入这些JournalNode,而 standby NameNode 读取这些 edit log,并作用在内存中的目录树中,该属性值应符合以下格式:

[qjournal://host1:port1;host2:port2;host3:port3/journalId]  
其中,journalId 是该命名空间的唯一 ID。

(6) dfs.client.failover.proxy.provider.[$nameservice ID]

设置客户端与 activeNameNode 进行交互的 Java 实现类,DFS 客户端通过该类寻找当前的 active NameNode。该类可由用户自己实现,默认实现为 ConfiguredFailoverProxyProvider。

(7) dfs.ha.fencing.methods

主备架构解决单点故障问题时,必须要认真解决的是脑裂问题,即出现两个 master 同时对外提供服务,导致系统处于不一致状态,可能导致数据丢失等潜在问题。在 HDFS HA 中,JournalNode 只允许一个 NameNode 写数据,不会出现两个 active NameNode 的问题, 但是,当主备切换时,之前的 active NameNode 可能仍在处理客户端的 RPC 请求,为此, 需要增加隔离机制(fencing)将之前的 active NameNode 杀死。
HDFS 允许用户配置多个隔离机制,当发生主备切换时,将顺次执行这些隔离机制,直到一个返回成功。Hadoop 2.0 内部打包了两种类型的隔离机制,分别是 shell 和 sshfence。

 

1) sshfence
sshfence 通过 ssh 登录到前一个 active NameNode 并将其杀死。为了让该机制成功执行,
需配置免密码 ssh 登陆,这可通过参数 dfs.ha.fencing.ssh.private-key-files 设置一个私钥文件。

<property>  

<name>dfs.ha.fencing.methods</name>  

<value>sshfence</value>  

</property>  

<property>  

<name>dfs.ha.fencing.ssh.private-key-files</name>  

<value>/home/hadoop/.ssh/id_rsa</value>  

</property>  

你可以配置一个 ssh 用户和端口号,并设置一个超时时间,一旦 ssh 超过该时间,则认为执 行失败。

<property>  

<name>dfs.ha.fencing.methods</name>  

<value>sshfence([[username][:port]])</value>  

</property>  

<property>  

<name>dfs.ha.fencing.ssh.connect-timeout</name>  

<value>30000</value>  

</property>  


2) shell
执行任意一个 shell 命令隔离旧的 active NameNode,配置方法如下:

<property>  

<name>dfs.ha.fencing.methods</name>  

[html] view plain copy

<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value> 

</property>  

 

注意,Hadoop 中所有参数将以环境变量的形似提供给该 shell,但所有的“.”被替换成了“_”, 比如“dfs.namenode.rpc-address.ns1.nn1”变为“dfs_namenode_rpc-address”

 

(8) fs.defaultFS

设置缺省的目录前缀,需在core-site.xml 中设置,比如命名服务的 ID 为 mycluster(参数 dfs.nameservices 指定的),则配置如下:

<property>  

<name>fs.defaultFS</name>  

<value>hdfs://nn</value>  

</property>  

 

(9) dfs.journalnode.edits.dir

JournalNode 所在节点上的一个目录,用于存放 editlog 和其他状态信息。该参数只能设置一个目录,你可以对磁盘做 RIAD 提高数据可靠性。

<property>  

<name>dfs.journalnode.edits.dir</name>  

<value>/opt/journal/node/local/data</value>  

</property>  

12.三台JN 上启动JN集群服务(s101、s102、s103)

  >hadoop-daemon.shstart journalnode

  >xcalljps

------------s100-----------------

4363 Jps

3421 QuorumPeerMain

------------s101-----------------

4034 JournalNode

4227 Jps

3118 QuorumPeerMain

------------s102-----------------

3105 QuorumPeerMain

4210 Jps

4015 JournalNode

------------s103-----------------

3392 JournalNode

3592 Jps

 

11.找一台NN 执行格式化----->前提是jouralnode启动无问题,保证/opt/hadoop目录为空

 S100:

>hdfsnamenode -format

 

此时可以进入JN集群查看edits文件,如果有就说明NN将edits交由JN管理

启动当前NN:

  >hadoop-daemon.shstart namenode

  >jps

 

12. 在没有格式化的NN上 执行同步命令

  s101:

         hdfsnamenode -bootstrapStandby(同步已经格式化的nameNode)

 

13. 格式ZK-----保证zk集群已经启动起来

s100:

>hdfs zkfc-formatZK

 

14. 启动集群

s100:

>start-dfs.sh

>jps

15. 测试

         s100:50070------>s100:8020(active)

         s101:50070------>s101:8020(standby)

  停止s100:hadoop-daemon.sh stop namenode

 

查看状态

>hdfs haadmin-getServiceState nn1

>hdfs haadmin-getServiceState nn2

 

状态转换

>hdfs haadmin-transitionToActive nn1

>hdfs haadmin-transitionToStandby nn2

         s100:50070(页面加载失败)

       s101:50070------>s101:8020(active)

         s100再启动

       s100:50070------->node01:8020(standby)

 

16.错误分析:

         两台namenode都是standby:查看zkfc的日志

                                cd /data/hadoop-2.7.0/logs

 

17.连接缓冲区:zkCli.sh

   查看:ls/hadoop-ha

   查看节点:get/hadoop-ha/neworigin---------->neworigin为nameservice

   取消连接:quit

 

 

 

 

二.Hdfs HA MR

        NN       DN      ZK     JN     ZKFC RS  NM

s100               1                 1             1

s101               1       1        1      1      1              1    

s102              1       1       1                      1  1

s103              1                1                      1  1

 

 

停止以上集群:

s100:

>stop-dfs.sh

         >jps

配置s100:

[mapred-site.xml]

<property>

             <name>mapreduce.framework.name</name>

       <value>yarn</value>

     </property>

 

 

[yarn-site.xml]

<property>

        <name>yarn.nodemanager.aux-services</name>

       <value>mapreduce_shuffle</value>

</property>

 

Yarn rm高可用:

<property>

  <name>yarn.resourcemanager.ha.enabled</name>

   <value>true</value>

 </property>

 <property>

  <name>yarn.resourcemanager.cluster-id</name>

   <value>cluster1</value>

 </property>

 <property>

   <name>yarn.resourcemanager.ha.rm-ids</name>

   <value>rm1,rm2</value>

 </property>

 <property>

  <name>yarn.resourcemanager.hostname.rm1</name>

   <value>s102</value>

 </property>

 <property>

  <name>yarn.resourcemanager.hostname.rm2</name>

   <value>s103</value>

 </property>

 <property>

  <name>yarn.resourcemanager.zk-address</name>

   <value>s100:2181,s101:2181,s102:2181</value>

 </property>

 

同步所有配置:

          s100 :

             > xsync /data/hadoop-2.7.0/etc/hadoop/

 

注意:slavesdateNode的配置,又是nodeManager的配置。

 

s100:

>start-all.sh

 

s102和s103:

       >yarn-daemon.sh start resourcemanager(单独启动resourcemanager)

 

         s103:

            netstat -npl | grep java(查看端口)

 

测试:s102:8088

                     s103:8088

     (一台standBy状态的rm自动跳到active状态的rm)

 

{再开启:一:s100、s101、s102一起启动zk

            zkServer.sh start}

         二:  s100:start-all.sh

         三:s102和s103:

                      yarn-daemon.sh startresourcemanager(单独启动resourcemanager)}

 


1.集群规划:

 主机名        IP                安装的软件                            运行的进程
drguo1  192.168.80.149 jdk、hadoop                         NameNode、DFSZKFailoverController(zkfc)、ResourceManager
drguo2 192.168.80.150  jdk、hadoop                         NameNode、DFSZKFailoverController(zkfc)、ResourceManager
drguo3  192.168.80.151  jdk、hadoop、zookeeper     DataNode、NodeManager、JournalNode、QuorumPeerMain
drguo4  192.168.80.152  jdk、hadoop、zookeeper     DataNode、NodeManager、JournalNode、QuorumPeerMain

drguo5 192.168.80.153  jdk、hadoop、zookeeper     DataNode、NodeManager、JournalNode、QuorumPeerMain

排的好好的,显示出来就乱了!!!

2.前期准备:

准备五台机器,修改静态IP、主机名、主机名与IP的映射,关闭防火墙,安装JDK并配置环境变量(不会请看这http://blog.csdn.net/dr_guo/article/details/50886667),创建用户:用户组,SSH免密码登录SSH免密码登录(报错请看这http://blog.csdn.net/dr_guo/article/details/50967442)。


注意:要把127.0.1.1那一行注释掉,要不然会出现jps显示有datanode,但网页显示live nodes为0;


注释之后就正常了,好像有人没注释也正常,我也不知道为什么0.0


3.搭建zookeeper集群(drguo3/drguo4/drguo5)

见:ZooKeeper完全分布式集群搭建

4.正式开始搭建Hadoop HA集群

去官网下最新的Hadoop(http://apache.opencas.org/hadoop/common/stable/),目前最新的是2.7.2,下载完之后把它放到/opt/Hadoop下

[plain] view plain copy
  1. guo@guo:~/下载$ mv ./hadoop-2.7.2.tar.gz /opt/Hadoop/  
  2. mv: 无法创建普通文件"/opt/Hadoop/hadoop-2.7.2.tar.gz": 权限不够  
  3. guo@guo:~/下载$ su root   
  4. 密码:   
  5. root@guo:/home/guo/下载# mv ./hadoop-2.7.2.tar.gz /opt/Hadoop/  
解压
[plain] view plain copy
  1. guo@guo:/opt/Hadoop$ sudo tar -zxf hadoop-2.7.2.tar.gz   
  2. [sudo] guo 的密码:  
解压jdk的时候我用的是tar -zxvf,其中的v呢就是看一下解压的过程,不想看你可以不写。

修改opt目录所有者(用户:用户组)直接把opt目录的所有者/组换成了guo。具体情况在ZooKeeper完全分布式集群搭建说过。

[plain] view plain copy
  1. root@guo:/opt/Hadoop# chown -R guo:guo /opt  

设置环境变量

[plain] view plain copy
  1. guo@guo:/opt/Hadoop$ sudo gedit /etc/profile  
在最后加上(这样设置在执行bin/sbin目录下的脚本时就不用进入该目录用了)
[plain] view plain copy
  1. #hadoop  
  2. export HADOOP_HOME=/opt/Hadoop/hadoop-2.7.2  
  3. export PATH=$PATH:$HADOOP_HOME/sbin  
  4. export PATH=$PATH:$HADOOP_HOME/bin  
然后更新配置
[plain] view plain copy
  1. guo@guo:/opt/Hadoop$ source /etc/profile  

修改/opt/Hadoop/hadoop-2.7.2/etc/hadoop下的hadoop-env.sh

[plain] view plain copy
  1. guo@guo:/opt/Hadoop$ cd hadoop-2.7.2  
  2. guo@guo:/opt/Hadoop/hadoop-2.7.2$ cd etc/hadoop/  
  3. guo@guo:/opt/Hadoop/hadoop-2.7.2/etc/hadoop$ sudo gedit ./hadoop-env.sh  
进入文件后
[plain] view plain copy
  1. export JAVA_HOME=${JAVA_HOME}#将这个改成JDK路径,如下  
  2. export JAVA_HOME=/opt/Java/jdk1.8.0_73  
然后更新文件配置
[plain] view plain copy
  1. guo@guo:/opt/Hadoop/hadoop-2.7.2/etc/hadoop$ source ./hadoop-env.sh 
前面配置和单机模式一样,我就直接复制了。
注意:汉语注释是给你看的,复制粘贴的时候都删了!!!
修改core-site.xml
[html] view plain copy
  1. <configuration>  
  2. <!-- 指定hdfs的nameservice为ns1 -->  
  3. <property>  
  4. <name>fs.defaultFS</name>  
  5. <value>hdfs://ns1/</value>  
  6. </property>  
  7. <!-- 指定hadoop临时目录 -->  
  8. <property>  
  9. <name>hadoop.tmp.dir</name>  
  10. <value>/opt/Hadoop/hadoop-2.7.2/tmp</value>  
  11. </property>  
  12. <!-- 指定zookeeper地址 -->  
  13. <property>  
  14. <name>ha.zookeeper.quorum</name>  
  15. <value>drguo3:2181,drguo4:2181,drguo5:2181</value>  
  16. </property>  
  17. </configuration>  
修改hdfs-site.xml
[html] view plain copy
  1. <configuration>  
  2.     <!--指定hdfs的nameservice为ns1,需要和core-site.xml中的保持一致 -->  
  3.     <property>  
  4.         <name>dfs.nameservices</name>  
  5.         <value>ns1</value>  
  6.     </property>  
  7.     <!-- ns1下面有两个NameNode,分别是nn1,nn2 -->  
  8.     <property>  
  9.         <name>dfs.ha.namenodes.ns1</name>  
  10.         <value>nn1,nn2</value>  
  11.     </property>  
  12.     <!-- nn1的RPC通信地址 -->  
  13.     <property>  
  14.         <name>dfs.namenode.rpc-address.ns1.nn1</name>  
  15.         <value>drguo1:9000</value>  
  16.     </property>  
  17.     <!-- nn1的http通信地址 -->  
  18.     <property>  
  19.         <name>dfs.namenode.http-address.ns1.nn1</name>  
  20.         <value>drguo1:50070</value>  
  21.     </property>  
  22.     <!-- nn2的RPC通信地址 -->  
  23.     <property>  
  24.         <name>dfs.namenode.rpc-address.ns1.nn2</name>  
  25.         <value>drguo2:9000</value>  
  26.     </property>  
  27.     <!-- nn2的http通信地址 -->  
  28.     <property>  
  29.         <name>dfs.namenode.http-address.ns1.nn2</name>  
  30.         <value>drguo2:50070</value>  
  31.     </property>  
  32.     <!-- 指定NameNode的元数据在JournalNode上的存放位置 -->  
  33.     <property>  
  34.         <name>dfs.namenode.shared.edits.dir</name>  
  35.         <value>qjournal://drguo3:8485;drguo4:8485;drguo5:8485/ns1</value>  
  36.     </property>  
  37.     <!-- 指定JournalNode在本地磁盘存放数据的位置 -->  
  38.     <property>  
  39.         <name>dfs.journalnode.edits.dir</name>  
  40.         <value>/opt/Hadoop/hadoop-2.7.2/journaldata</value>  
  41.     </property>  
  42.     <!-- 开启NameNode失败自动切换 -->  
  43.     <property>  
  44.         <name>dfs.ha.automatic-failover.enabled</name>  
  45.         <value>true</value>  
  46.     </property>  
  47.     <!-- 配置失败自动切换实现方式 -->  
  48.     <property>  
  49.         <name>dfs.client.failover.proxy.provider.ns1</name>  
  50.         <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>  
  51.     </property>  
  52.     <!-- 配置隔离机制方法,多个机制用换行分割,即每个机制暂用一行-->  
  53.     <property>  
  54.         <name>dfs.ha.fencing.methods</name>  
  55.         <value>  
  56.             sshfence  
  57.             shell(/bin/true)  
  58.         </value>  
  59.     </property>  
  60.     <!-- 使用sshfence隔离机制时需要ssh免登陆 -->  
  61.     <property>  
  62.         <name>dfs.ha.fencing.ssh.private-key-files</name>  
  63.         <value>/home/guo/.ssh/id_rsa</value>  
  64.     </property>  
  65.     <!-- 配置sshfence隔离机制超时时间 -->  
  66.     <property>  
  67.         <name>dfs.ha.fencing.ssh.connect-timeout</name>  
  68.         <value>30000</value>  
  69.     </property>  
  70. </configuration>  
先将mapred-site.xml.template改名为mapred-site.xml然后修改mapred-site.xml
[html] view plain copy
  1. <configuration>  
  2.     <!-- 指定mr框架为yarn方式 -->  
  3.     <property>  
  4.         <name>mapreduce.framework.name</name>  
  5.         <value>yarn</value>  
  6.     </property>  
  7. </configuration>  
修改yarn-site.xml
[html] view plain copy
  1. <configuration>  
  2.     <!-- 开启RM高可用 -->  
  3.     <property>  
  4.        <name>yarn.resourcemanager.ha.enabled</name>  
  5.        <value>true</value>  
  6.     </property>  
  7.     <!-- 指定RM的cluster id -->  
  8.     <property>  
  9.        <name>yarn.resourcemanager.cluster-id</name>  
  10.        <value>yrc</value>  
  11.     </property>  
  12.     <!-- 指定RM的名字 -->  
  13.     <property>  
  14.        <name>yarn.resourcemanager.ha.rm-ids</name>  
  15.        <value>rm1,rm2</value>  
  16.     </property>  
  17.     <!-- 分别指定RM的地址 -->  
  18.     <property>  
  19.        <name>yarn.resourcemanager.hostname.rm1</name>  
  20.        <value>drguo1</value>  
  21.     </property>  
  22.     <property>  
  23.        <name>yarn.resourcemanager.hostname.rm2</name>  
  24.        <value>drguo2</value>  
  25.     </property>  
  26.     <!-- 指定zk集群地址 -->  
  27.     <property>  
  28.        <name>yarn.resourcemanager.zk-address</name>  
  29.        <value>drguo3:2181,drguo4:2181,drguo5:2181</value>  
  30.     </property>  
  31.     <property>  
  32.        <name>yarn.nodemanager.aux-services</name>  
  33.        <value>mapreduce_shuffle</value>  
  34.     </property>  
  35. </configuration>  
修改slaves
[html] view plain copy
  1. drguo3  
  2. drguo4  
  3. drguo5  
把Hadoop整个目录拷贝到drguo2/3/4/5,拷之前把share下doc删了(文档不用拷),这样会快点。
5.启动zookeeper集群(分别在drguo3、drguo4、drguo5上启动zookeeper)
[plain] view plain copy
  1. guo@drguo3:~$ zkServer.sh start  
  2. ZooKeeper JMX enabled by default  
  3. Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg  
  4. Starting zookeeper ... STARTED  
  5. guo@drguo3:~$ jps  
  6. 2005 Jps  
  7. 1994 QuorumPeerMain  
  8. guo@drguo3:~$ ssh drguo4  
  9. Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-16-generic x86_64)  
  10.   
  11.  * Documentation:  https://help.ubuntu.com/  
  12.   
  13. Last login: Fri Mar 25 14:04:43 2016 from 192.168.80.151  
  14. guo@drguo4:~$ zkServer.sh start  
  15. ZooKeeper JMX enabled by default  
  16. Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg  
  17. Starting zookeeper ... STARTED  
  18. guo@drguo4:~$ jps  
  19. 1977 Jps  
  20. 1966 QuorumPeerMain  
  21. guo@drguo4:~$ exit  
  22. 注销  
  23. Connection to drguo4 closed.  
  24. guo@drguo3:~$ ssh drguo5  
  25. Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-16-generic x86_64)  
  26.   
  27.  * Documentation:  https://help.ubuntu.com/  
  28.   
  29. Last login: Fri Mar 25 14:04:56 2016 from 192.168.80.151  
  30. guo@drguo5:~$ zkServer.sh start  
  31. ZooKeeper JMX enabled by default  
  32. Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg  
  33. Starting zookeeper ... STARTED  
  34. guo@drguo5:~$ jps  
  35. 2041 Jps  
  36. 2030 QuorumPeerMain  
  37. guo@drguo5:~$ exit  
  38. 注销  
  39. Connection to drguo5 closed.  
  40. guo@drguo3:~$ zkServer.sh status  
  41. ZooKeeper JMX enabled by default  
  42. Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg  
  43. Mode: leader  
6.启动journalnode(分别在drguo3、drguo4、drguo5上启动journalnode)注意只有第一次需要这么启动,之后启动hdfs会包含journalnode。
[plain] view plain copy
  1. guo@drguo3:~$ hadoop-daemon.sh start journalnode  
  2. starting journalnode, logging to /opt/Hadoop/hadoop-2.7.2/logs/hadoop-guo-journalnode-drguo3.out  
  3. guo@drguo3:~$ jps  
  4. 2052 Jps  
  5. 2020 JournalNode  
  6. 1963 QuorumPeerMain  
  7. guo@drguo3:~$ ssh drguo4  
  8. Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-16-generic x86_64)  
  9.   
  10.  * Documentation:  https://help.ubuntu.com/  
  11.   
  12. Last login: Fri Mar 25 00:09:08 2016 from 192.168.80.149  
  13. guo@drguo4:~$ hadoop-daemon.sh start journalnode  
  14. starting journalnode, logging to /opt/Hadoop/hadoop-2.7.2/logs/hadoop-guo-journalnode-drguo4.out  
  15. guo@drguo4:~$ jps  
  16. 2103 Jps  
  17. 2071 JournalNode  
  18. 1928 QuorumPeerMain  
  19. guo@drguo4:~$ exit  
  20. 注销  
  21. Connection to drguo4 closed.  
  22. guo@drguo3:~$ ssh drguo5  
  23. Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-16-generic x86_64)  
  24.   
  25.  * Documentation:  https://help.ubuntu.com/  
  26.   
  27. Last login: Thu Mar 24 23:52:17 2016 from 192.168.80.152  
  28. guo@drguo5:~$ hadoop-daemon.sh start journalnode  
  29. starting journalnode, logging to /opt/Hadoop/hadoop-2.7.2/logs/hadoop-guo-journalnode-drguo5.out  
  30. guo@drguo5:~$ jps  
  31. 2276 JournalNode  
  32. 2308 Jps  
  33. 1959 QuorumPeerMain  
  34. guo@drguo5:~$ exit  
  35. 注销  
  36. Connection to drguo5 closed.  
在drguo4/5启动时发现了问题,没有journalnode,查看日志发现是因为汉语注释造成的,drguo4/5全删了问题解决。drguo4/5的拼音输入法也不能用,我很蛋疼。。镜像都是复制的,咋还变异了呢。
7.格式化HDFS(在drguo1上执行)
[plain] view plain copy
  1. guo@drguo1:/opt$ hdfs namenode -format  
这回又出问题了,还是汉语注释闹得,drguo1/2/3也全删了,问题解决。
注意:格式化之后需要把tmp目录拷给drguo2(不然drguo2的namenode起不来)
[plain] view plain copy
  1. guo@drguo1:/opt/Hadoop/hadoop-2.7.2$ scp -r tmp/ drguo2:/opt/Hadoop/hadoop-2.7.2/  
8.格式化ZKFC(在drguo1上执行)
[plain] view plain copy
  1. guo@drguo1:/opt$ hdfs zkfc -formatZK  
9.启动HDFS(在drguo1上执行)
[plain] view plain copy
  1. guo@drguo1:/opt$ start-dfs.sh   
10.启动YARN(在drguo1上执行)
[plain] view plain copy
  1. guo@drguo1:/opt$ start-yarn.sh   
PS:
1.drguo2的resourcemanager需要手动单独启动:
yarn-daemon.sh start resourcemanager
2.namenode、datanode也可以单独启动:
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
3.NN 由standby转化成active
hdfs haadmin -transitionToActive nn1 --forcemanual
大功告成!!!
是不是和之前规划的一样0.0

原创粉丝点击