hadoop2.2.0升级到2.7.2

来源:互联网 发布:苹果移动数据打不开 编辑:程序博客网 时间:2024/05/18 02:34
1、非高可用集群安装及配置
    配置了1个master 2个slave,启动正常,并添加相关数据
2、升级为手动高可用集群(与正式环境一致)
    
    2.1、配置手动故障转移hdfs HA (此处不需要zk,自动切换才依赖zk)
        ---backup
            cp -r /home/test/hadoop-2.2.0/etc/hadoop /home/test/hadoop-2.2.0/etc/hadoopbak
        ---core-site.xml
                <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://testcluster</value>
          </property>
    ---hdfs-site.xml
       delete: dfs.namenode.secondary.http-address
       add:  
               <property>
                    <name>dfs.nameservices</name>
                    <value>testcluster</value>
                    </property>
                    <property>
                    <name>dfs.ha.namenodes.testcluster</name>
                    <value>master,slave1</value>
            </property>
            <property>
                    <name>dfs.namenode.rpc-address.testcluster.master</name>
                    <value>master:9000</value>
            </property>
            <property>
                    <name>dfs.namenode.rpc-address.testcluster.slave1</name>
                    <value>slave1:9000</value>
            </property>
            <property>
                    <name>dfs.namenode.http-address.testcluster.master</name>
                    <value>master:50070</value>
            </property>
            <property>
                    <name>dfs.namenode.http-address.testcluster.slave1</name>
                    <value>slave1:50070</value>
            </property>
            <property>
                    <name>dfs.ha.automatic-failover.enabled.testcluster</name>
                    <value>flase</value>
            </property>
            <property>
                    <name>dfs.namenode.shared.edits.dir</name>
                    <value>qjournal://master:8485;slave1:8485;slave2:8485/testcluster</value>

            </property>

           <property>
                <name>dfs.client.failover.proxy.provider.testcluster</name>
                <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
          </property>

            <property>
                    <name>dfs.ha.fencing.methods</name>
                    <value>sshfence</value>
            </property>
            <property>
                    <name>dfs.ha.fencing.ssh.private-key-files</name>
                    <value>/home/test/.ssh/id_rsa</value>
            </property>
            <property>
                    <name>dfs.journalnode.edits.dir</name>
                    <value>/data/test/tmp/journal</value>
            </property>
         copy : scp core-site.xml hdfs-site.xml slave1:/home/test/hadoop-2.2.0/etc/hadoop
                         scp core-site.xml hdfs-site.xml slave2:/home/test/hadoop-2.2.0/etc/hadoop
       ---初始化journalnode,每个jn节点上执行
               hadoop-daemon.sh start journalnode
               http://master:8480/
       ---格式化所有JournalNode(在master节点执行即可)
               hdfs namenode -initializeSharedEdits -force
               注: 这一操作主要完成格式化所有 JournalNode,以及将namenode下的元数据文件从master拷贝到所有JournalNode
       ---namenode元数据同步(需要从master同步到slave1,在slave1中执行命令)
               hdfs namenode -bootstrapstandby
       ---启动集群
                 start-all.sh
         ---集群启动后均为standby状态,需要手工切换namenode为active状态
                 hdfs haadmin -transitionToActive master
         ---验证NN状态
                 hdfs haadmin -getServiceState master
                 hdfs haadmin -getServiceState slave1
         ---主从切换命令
        hdfs haadmin -DFSHAadmin -failover master slave1
3、    hadoop2.2.0到2.7.2版本升级步骤
    3.1、准备2.7.2安装包,并重命名包下的etc目录,并做好映射
            mv etc etcbak
            ln -s /home/test/hadoop-2.2.0/etc etc
    3.2、停外部应用
    3.3、备份namenode元数据
        b、进入安全模式: hadoop dfsadmin -safemode enter
        c、合并edits并备份namenode元数据: hadoop dfsadmin -saveNamespace
        d、备份:  cp -r /data/hadoop/dfs/name /data/hadoop/dfs/name_bak
    3.4、停hdfs应用
        stop-all.sh
    3.4、修改集群各节点所有环境变量
        ./dcopy /home/test/.bash_profile /home/test/.bash_profile_`date +%Y%m%d%H%M`
        ./drun "ls -la  /home/test/"
        ./drun "sed -i 's/hadoop-2.2.0/hadoop-2.7.2/g' /home/test/.bash_profile"
        ./drun "source /home/test/.bash_profile"
        ./drun "grep -i \"hadoop-2.7.2\"  /home/test/.bash_profile"
      ./drun "echo $HADOOP_HOME"
    3.5、开始升级
        a、在相应节点启动journalnode: hadoop-daemon.sh start journalnode
        b、升级namenode:
           先升一个namenode:        hadoop-daemon.sh start namenode -upgrade
           再同步另namenode:    hdfs namenode -bootstrapstandby & hadoop-daemon.sh start namenode
      c、升级datanode:

           hadoop-daemons.sh start datanode

      c2、spark 配置调整:
                修改spark-env.sh中hadoop配置参数

      d、验证:
            检查数据完整性:  Hadoop fsck /
            随机查看文件:hadoop fs -cat .....
        e、有问题回滚:
            方式1(没试过):
                    hadoop-daemon.sh start namenode -rollback
                    hadoop-daemons.sh start datanode –rollback
            方式2:
                    ./dcopy /home/test/.bash_profile /home/test/.bash_profile_`date +%Y%m%d%H%M`
                    ./drun "ls -la  /home/test/"
                    ./drun "rm -rf /home/test/.bash_profile"
                    ./drun "mv /home/test/.bash_profile_201711162042 /home/test/.bash_profile"
                    ./drun "source /home/test/.bash_profile"
                     ./drun "grep -i \"hadoop-2.2.0\"  /home/test/.bash_profile"
                   ./drun "echo $HADOOP_HOME"
                   mv /data/hadoop/dfs/name /data/hadoop/dfs/name_new
                     mv /data/hadoop/dfs/name_bak /data/hadoop/dfs/name
                     hadoop-daemons.sh start journalnode
                     拷贝元数据到journalnode:
                      ./drun "rm -rf /data/test/tmp/journal/"
                      hdfs namenode -initializeSharedEdits -force
                   hadoop-daemon.sh start namenode
                   启动另一台nn:
                       hdfs namenode -bootstrapstandby
                       hadoop-daemon.sh start namenode
                   hadoop-daemons.sh start datanode
                   hdfs haadmin -transitionToActive master
                 hdfs haadmin -DFSHAadmin -failover master slave1
      f、运行一段时间稳定后提交升级:
           hdfs dfsadmin -finalizeUpgrade
      h、重启集群:
         stop-dfs.sh
         start-all.sh
         hdfs haadmin -transitionToActive master

         hdfs haadmin -DFSHAadmin -failover master slave1  



FAQ:

Q1: 升级后HDFS  监控页面报错

活节点打开报错:


备用节点正常

 

输入如下地址则正常:

http://192.168.130.136:50070/dfshealth.html#tab-overview

尚不清楚原因。

Q2:回退时,启动journalnode的日志报错

2017-11-1710:50:27,513 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lockstorage /data/test/tmp/journal/testcluster. The directory is already locked

2017-11-1710:50:27,514 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionExceptionas:test (auth:SIMPLE) cause:java.io.IOException: Cannot lock storage/data/test/tmp/journal/testcluster. The directory is already locked

2017-11-1710:50:27,514 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485,call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.getEditLogManifestfrom 192.168.130.136:50636 Call#26 Retry#0: error: java.io.IOException: Cannotlock storage /data/test/tmp/journal/testcluster. The directory is alreadylocked

java.io.IOException:Cannot lock storage /data/test/tmp/journal/testcluster. The directory isalready locked

        atorg.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:637)

        atorg.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:460)

        atorg.apache.hadoop.hdfs.qjournal.server.JNStorage.analyzeStorage(JNStorage.java:193)

        atorg.apache.hadoop.hdfs.qjournal.server.JNStorage.<init>(JNStorage.java:73)

        atorg.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:140)

        atorg.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:83)

        atorg.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:181)

        atorg.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)

        atorg.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:17453)

        atorg.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

        atorg.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)

        atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

        atjava.security.AccessController.doPrivileged(Native Method)

        atjavax.security.auth.Subject.doAs(Subject.java:415)

        atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

        atorg.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

解决办法:

删除journalnode目录:./drun "rm -rf /data/test/tmp/journal/"

拷贝元数据到journalnode:  hdfs namenode-initializeSharedEdits -force

 

Q3:回退时,datanode启动报集群版本问题:

org.apache.hadoop.hdfs.server.common.Storage:Lock on /data/hadoop/dfs/data/in_use.lock acquired by nodename 11414@master

java.io.IOException:Incompatible clusterIDs in /data/hadoop/dfs/data: namenode clusterID =CID-be66f5bb-6419-45c5-b95a-7681be449e15; datanode clusterID =

        atorg.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)

        atorg.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)

        atjava.lang.Thread.run(Thread.java:745)

 

解决办法:

到namenode元数据目录查找当前版本,并执行:

./drun "sed -i's/clusterID=/clusterID=CID-be66f5bb-6419-45c5-b95a-7681be449e15/g'/data/hadoop/dfs/data/current/VERSION"

./drun "cat/data/hadoop/dfs/data/current/VERSION"

执行后,继续报错:

FATALorg.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed forblock pool Block pool BP-1638541588-192.168.130.136-1510661480810 (storage idDS1875847850) service to master/192.168.130.136:9000

org.apache.hadoop.hdfs.server.common.IncorrectVersionException:Unexpected version of storage directory/data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810.Reported: -56. Expecting = -47.

        atorg.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1082)

        atorg.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)

        atorg.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:921)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)

        atorg.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)

        atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)

        atorg.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)

        atorg.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)

        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)

        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)

        atjava.lang.Thread.run(Thread.java:745)

 

解决办法:

./drun "cat/data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810/current/VERSION"

./drun "cp /data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810/current/VERSION/data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810/current/VERSION-bak1"

./drun "sed -i's/layoutVersion=-56/layoutVersion=-47/g' /data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810/current/VERSION"

执行后,仍然继续报错:

2017-11-1715:06:35,405 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenodemaster/192.168.130.136:9000 using DELETEREPORT_INTERVAL of 300000

msec  BLOCKREPORT_INTERVAL of 21600000msec Initialdelay: 0msec; heartBeatInterval=3000

2017-11-1715:06:35,405 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exceptionin BPOfferService for Block pool BP-1638541588-192.168.130.136-15106

61480810(storage id DS347578212) service to master/192.168.130.136:9000

java.lang.NullPointerException

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:439)

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)

        atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)

        atjava.lang.Thread.run(Thread.java:745)

 

这个是由于datanode版本中ctime时间与namenode中ctime不一致,更新即可。

 ./drun "sed -i 's/cTime=1510838012825/cTime=0/g'/data/hadoop/dfs/data/current/BP-1638541588-192.168.130.136-1510661480810/current/VERSION"

 

 

 

 

Q4: hive内部报错

Caused by:java.lang.reflect.InvocationTargetException

       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

       atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

       atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

       at java.lang.reflect.Method.invoke(Method.java:601)

       atorg.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)

       ... 21 more

Caused by:java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodecnot found.

       atorg.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)

       atorg.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:179)

       atorg.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)

       ... 26 more

Caused by:java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec notfound

       at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)

       atorg.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)

       ... 28 more

 

原因:hadoop2.7.2中找不到lzo lib包

同步lzo即可

./drun "cp/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar/home/hadoop/hadoop-2.7.2/share/hadoop/common/"

./drun "cp/home/hadoop/hadoop-2.2.0/lib/native/libgplcompression*/home/hadoop/hadoop-2.7.2/lib/native/"

 

 

Q5:启用高可用后,执行hql报错:

create tabletest_data_tmp2 as select * from test_data_tmp1;

 

Moving data to:hdfs://bis-newdatanode-s2b-80:9000/user/hive/warehouse/test.db/.hive-staging_hive_2017-11-20_21-12-31_880_1843842567826987200-1/-ext-10001

Failed with exception Wrong FS:hdfs://bis-newdatanode-s2b-80:9000/user/hive/warehouse/test.db/.hive-staging_hive_2017-11-20_21-12-31_880_1843842567826987200-1/-ext-10003,expected: hdfs://testcluster

FAILED: Execution Error, return code 1 fromorg.apache.hadoop.hive.ql.exec.MoveTask

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 4.78 sec   HDFS Read: 10883134 HDFS Write: 10880084SUCCESS

Total MapReduce CPU Time Spent: 4 seconds780 msec

 

解决:

Hive元数据问题,在Hive数据库中有两张表:

DBS  : Hive数据仓库的总路径

SDS  : Hive每张表对应的路径

数据库中保存了原来的hdfs的路径,修改成HA对应的别名即可

update DBS set DB_LOCATION_URI=REPLACE (DB_LOCATION_URI,'bis-newdatanode-s2b-80:9000','testcluster'); 

updateSDS set LOCATION=REPLACE(LOCATION,'bis-newdatanode-s2b-80:9000','testcluster'); 
原创粉丝点击