Centos6 Hadoop2.2.0完全安装配置步骤20140222

来源:互联网 发布:苏宁豆芽软件 编辑:程序博客网 时间:2024/05/18 00:42
Centos6 Hadoop2.2.0完全安装配置步骤20140222


一、运行环境:
centos6.3 64bit
jdk1.7.0_51
hadoop-2.2.0-src 64bit


编译安装hadoop2.2.0的必备条件:
Build instructions for Hadoop
----------------------------------------------------------------------------------
Requirements:
* Unix System
* JDK 1.6+
* Maven 3.0 or later
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.5.0
* CMake 2.6 or newer (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)




二、集群规模3台,其中1个做NameNode、SecondnameNode、ResourceManger,另外2个做DataNode、NodeManager:
192.168.100.211 master211
192.168.100.212 slave212
192.168.100.213 slave213
 


三、安装
因官方提供的二进制版本的hadoop为32位系统,要想运行在64位系统必须下载源码进行编译安装;
hadoop集群环境及配置要求一致,故本文以下所有配置均在master211上完成,然后再拷贝到另外2台salve。


1、运行hadoop需要安装JDK环境
官方下载:http://download.oracle.com/otn-pub/java/jdk/7u51-b13/jdk-7u51-linux-x64.tar.gz
#tar xzvf jdk-7u51-linux-x64.gz 
#cp -r jdk1.7.0_51/  /usr/local/
#vim /etc/profile
 61 export JAVA_HOME=/usr/local/jdk1.7.0_51
 62 export PATH=$PATH:$JAVA_HOME/bin
#source /etc/profile
#[root@master211 hadoop-2.2.0]# java -version      
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


如果显示的版本不是自己希望的,可能系统已经安装了其他版本的JDK,将已安装的卸载即可:
[root@master211 hadoop-2.2.0]# yum list installed|grep java            
java-1.6.0-openjdk.x86_64
java-1.6.0-openjdk-devel.x86_64
tzdata-java.noarch      2012c-1.el6     @anaconda-CentOS-201207061011.x86_64/6.3
[root@master211 hadoop-2.2.0]# yum erase java -ys


[root@master211 hadoop-2.2.0]# java -version      
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)




2、编译hadoop需要用到java构建工具maven
maven的安装及使用也可参考http://maven.oschina.net/help.html


#wget http://mirror.esocc.com/apache/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
#tar xzvf apache-maven-3.1.1-bin.tar.gz 
#cp -r apache-maven-3.1.1 /usr/local/maven-3.1.1
#vim /etc/profile
 60 ###for hadoop
 61 export MAVEN_HOME=/usr/local/maven-3.1.1
 62 export PATH=$PATH:$MAVEN_HOME/bin
或者 export PATH=/usr/local/apache-maven-3.x.y/bin:$PATH


[root@master211 hadoop-2.2.0]# mvn -version
Apache Maven 3.1.1 (0728685237757ffbf44136acec0402957f723d9a; 2013-09-17 23:22:22+0800)
Maven home: /usr/local/maven-3.1.1
Java version: 1.7.0_51, vendor: Oracle Corporation
Java home: /usr/local/jdk1.7.0_51/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-279.el6.x86_64", arch: "amd64", family: "unix"


配置maven使用国内oschina的镜像,加快build速度
# vim /usr/local/maven-3.1.1/conf/settings.xml
(1)、配置 Maven 的 mirror 地址指向OSChina 的 Maven 镜像地址
    <mirror>
        <id>nexus-osc</id>
        <mirrorOf>*</mirrorOf>
        <name>Nexus osc</name>
        <url>http://maven.oschina.net/content/groups/public/</url>
    </mirror>




(2)、在执行 Maven 命令的时候, Maven 还需要安装一些插件包,这些插件包的下载地址也让其指向 OSChina 的 Maven 地址
<profile>
    <id>jdk-1.7.0_51/id>


    <activation>
        <jdk>1.7.0_51</jdk>
    </activation>
    <repositories>
        <repository>
            <id>nexus</id>
            <name>local private nexus</name>
            <url>http://maven.oschina.net/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>nexus</id>
            <name>local private nexus</name>
            <url>http://maven.oschina.net/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>
</profile>






3、编译安装64bit hadoop


http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.2.0/
File Name File SizeDate
../ - -
hadoop-2.2.0-src.tar.gz 1949239507-Oct-2013 06:46
hadoop-2.2.0-src.tar.gz.mds 111607-Oct-2013 06:46
hadoop-2.2.0.tar.gz 10922907307-Oct-2013 06:46
hadoop-2.2.0.tar.gz.mds 95807-Oct-2013 06:47


国内北京理工大学地址http://mirror.bit.edu.cn/apache/hadoop/core/hadoop-2.2.0/




#tar xzvf hadoop-2.2.0-src.tar.gz
#cd /root/hadoop-2.2.0/hadoop-2.2.0-src


执行clean
#mvn clean install -DskipTests


[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:2.2.0:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: 'protoc --version' did not return a version -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hadoop-common


提示需要安装protoc,什么是protoc哪?
Google Protocol Buffer,Protocol Buffers 是一种轻便高效的结构化数据存储格式,可以用于结构化数据串行化,很适合做数据存储或 RPC 数据交换格式。它可用于通讯协议、数据存储等领域的语言无关、平台无关、可扩展的序列化结构数据格式。目前提供了 C++、Java、Python 三种语言的 API。


安装protobuf
#wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
#tar xzvf protobuf-2.5.0.tar.gz
#cd protobuf-2.5.0
#./configure
#make
#make install
[root@master211 protobuf-2.5.0]# protoc  --version
libprotoc 2.5.0




编译hadoop时还需要ssl、cmake、ncurses支持
yum install cmake
yum install openssl-devel
yum install ncurses-devel




继续编译生成二进制文件Create binary distribution with native code and without documentation:
#mvn package -Pdist,native -DskipTests -Dtar


编译后的路径hadoop-dist/target/hadoop-2.2.0
# ls hadoop-dist/target/hadoop-2.2.0
bin  etc  include  lib  libexec  sbin  share


查看编译后的属性,比如是否支持64位系统
# file hadoop-dist/target/hadoop-2.2.0/lib/native/* 
hadoop-dist/target/hadoop-2.2.0/lib/native/libhadoop.a:        current ar archive
hadoop-dist/target/hadoop-2.2.0/lib/native/libhadooppipes.a:   current ar archive
hadoop-dist/target/hadoop-2.2.0/lib/native/libhadoop.so:       symbolic link to `libhadoop.so.1.0.0'
hadoop-dist/target/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
hadoop-dist/target/hadoop-2.2.0/lib/native/libhadooputils.a:   current ar archive
hadoop-dist/target/hadoop-2.2.0/lib/native/libhdfs.a:          current ar archive
hadoop-dist/target/hadoop-2.2.0/lib/native/libhdfs.so:         symbolic link to `libhdfs.so.0.0.0'
hadoop-dist/target/hadoop-2.2.0/lib/native/libhdfs.so.0.0.0:   ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped




# cd  hadoop-dist/target/hadoop-2.2.0
# ./bin/hadoop version
Hadoop 2.2.0
Subversion Unknown -r Unknown
Compiled by root on 2014-02-25T15:55Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /root/hadoop-2.2.0/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar




# cd /root/hadoop-2.2.0/hadoop-2.2.0-src/hadoop-dist/target
# cp -r hadoop-2.2.0 /usr/local/
#chown hdfs.hdfs /usr/local/hadoop-2.2.0/ -R


创建hadoop的环境变量
export HADOOP_HOME=/usr/local/hadoop-2.2.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin


自此hadoop在master上编译安装完成。




四、hadoop集群配置及启动
创建集群专用用户hdfs
#useradd hdfs
#passwd  hdfs


需要实现 Master到所有的Slave单向ssh信任(即免密码登录)
切换到hdfs用户:
#ssh-keygen
#ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@localhost
#ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@slave212
#ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@slave213




定义并创建如下相关目录(需要在集群所有节点创建)
/usr/local/hadoop-2.2.0/tmp
数据和编辑文件
/usr/local/hadoop-2.2.0/dfs/data
/usr/local/hadoop-2.2.0/dfs/name
存放数据
/usr/local/hadoop-2.2.0/mapred/local
/usr/local/hadoop-2.2.0/mapred/system
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/dfs  
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/dfs/name
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/dfs/data
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/mapred
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/mapred/local
[hadoop@vm_cacti158 hadoop]$ mkdir /usr/local/hadoop-2.2.0/mapred/system






修改集群配置文件
配置$HADOOP_HOME/etc/hadoop下:
1、 hadoop-env.sh中添加JAVA_HOME, 如:
# The java implementation to use.
 export JAVA_HOME=/usr/local/jdk1.7.0_51




2、yarn-env.sh
 export JAVA_HOME=/usr/local/jdk1.7.0_51


3、core-site.xml(tmp目录需手动创建):
 <!-- first configure
        <property>
             <name>hadoop.tmp.dir</name>
             <value>/usr/local/hadoop-2.2.0/tmp</value>
        </property>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://master211:9000</value>    
        </property>
 -->


<!-- 新变量fs.defaultFS 代替旧的:fs.default.name -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master211:9000</value>
    <description>The name of the default file system.</description>
</property>


<property>
    <name>hadoop.tmp.dir</name>
    <!-- 注意创建相关的目录结构 -->
    <value>/usr/local/hadoop-2.2.0/tmp</value>
    <description>A base for other temporary directories.</description>
</property>




4、mapred-site.xml
<!-- first configure
        <property>
             <name>mapred.job.tracker</name>
             <value>master211:9001</value>
        </property>
-->


<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <final>true</final>
</property>




5、hdfs-site.xml
<!-- first configure
        <property>
             <name>dfs.replication</name>
             <value>2</value>
        </property>
-->


<property>
    <name>dfs.replication</name>
    <!-- 值需要与实际的DataNode节点数要一致 -->
    <value>2</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <!-- 注意创建相关的目录结构 -->
    <value>file:/usr/local/hadoop-2.2.0/dfs/name</value>
    <final>true</final>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <!-- 注意创建相关的目录结构 -->
    <value>file:/usr/local/hadoop-2.2.0/dfs/data</value>
</property>




6、yarn-site.xml
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>


<!--  resourcemanager hostname或ip地址-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master211</value>
</property>






7、 slaves
slave212
slave213




将整个hadoop目录复制到其他两台机器:
scp -r hadoop-2.2.0 slave212:/usr/local/hadoop-2.2.0/
scp -r hadoop-2.2.0 slave213:/usr/local/hadoop-2.2.0/


scp -r /usr/local/jdk1.7.0_51/ slave212:/home/hdfs
scp -r /usr/local/jdk1.7.0_51/ slave213:/home/hdfs


scp /etc/profile root@slave212:/etc/
scp /etc/profile root@slave213:/etc/




格式化hadoop文件系统:
[hdfs@master211 hadoop-2.2.0]$ hdfs namenode -format
启动hdfs
[hdfs@master211 hadoop-2.2.0]$ start-dfs.sh
启动yarn
[hdfs@master211 hadoop-2.2.0]$ start-yarn.sh
[hdfs@master211 hadoop-2.2.0]$ jps
32570 Jps
32193 SecondaryNameNode
32327 ResourceManager
32019 NameNode


[root@slave212 hdfs]# jps
14163 DataNode
14267 NodeManager
14367 Jps




查看dfs是否正常
$ hdfs dfsadmin -report




若停止hadoop,依次运行如下命令:
$stop-yarn.sh
$stop-dfs.sh




若要保存所有提交job的历史记录,需要开启historyserver
启动job historyserver
mr-jobhistory-daemon.sh start historyserver




五、集群测试:
在HDFS上创建新目录,可进一步测试HDFS是否工作正常:
hdfs dfs -mkdir /xxx  
hdfs dfs -ls /




演示hdfs 的一些常用命令,为wordcount演示做准备
[hdfs@master211 logs]$ hdfs dfs -ls /
[hdfs@master211 logs]$ hdfs dfs -mkdir -p /user/laijingli/wordcount/input
[hdfs@master211 logs]$ hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - hdfs supergroup          0 2014-02-27 17:54 /user




演示job的运行过程--- wordcount示例
本地创建如下3个文件:
[hdfs@master211 logs]$ ls /home/hdfs/       
1.txt  2.txt  3.txt
[hdfs@master211 logs]$ cat /home/hdfs/1.txt 
welcome to Hadoop 
This product includes software developed by The Apache Software
[hdfs@master211 logs]$ cat /home/hdfs/2.txt  
welcome to BigData
The Apache Hadoop project contains subcomponents with separate copyright
[hdfs@master211 logs]$ cat /home/hdfs/3.txt  
welcome to Spark 
Licensed under the Apache License, Version 2.0


把这三个文件上传到hdfs:
[hdfs@master211 logs]$ hdfs dfs -put /home/hdfs/*.txt /user/laijingli/wordcount/input
[hdfs@master211 logs]$ hdfs dfs -ls /user/laijingli/wordcount/input
Found 3 items
-rw-r--r--   2 hdfs supergroup         83 2014-02-27 17:55 /user/laijingli/wordcount/input/1.txt
-rw-r--r--   2 hdfs supergroup         92 2014-02-27 17:55 /user/laijingli/wordcount/input/2.txt
-rw-r--r--   2 hdfs supergroup         65 2014-02-27 17:55 /user/laijingli/wordcount/input/3.txt
[hdfs@master211 logs]$ 


执行wordcount作业:
[hdfs@master211 logs]$ hadoop jar /usr/local/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /user/laijingli/wordcount/input /user/laijingli/wordcount/output


查看执行结果:到此 wordcount的job已经执行完成,执行如下命令可以查看刚才job的执行结果:
[hdfs@master211 logs]$ hdfs dfs -ls /user/laijingli/wordcount/output
Found 2 items
-rw-r--r--   2 hdfs supergroup          0 2014-02-27 17:59 /user/laijingli/wordcount/output/_SUCCESS
-rw-r--r--   2 hdfs supergroup        243 2014-02-27 17:59 /user/laijingli/wordcount/output/part-r-00000
[hdfs@master211 logs]$ hdfs dfs -cat /user/laijingli/wordcount/output/part-r-00000
2.0     1
Apache  3
BigData 1
Hadoop  2
License,        1
Licensed        1
Software        1
Spark   1
The     2
This    1
Version 1
by      1
contains        1
copyright       1
developed       1
includes        1
product 1
project 1
separate        1
software        1
subcomponents   1
the     1
to      3
under   1
welcome 3
with    1




web查看job的执行情况:
http://master211:8088/cluster




提交任务测试,其中最后两个参数是map数和reduce数目
[hdfs@master211 logs]$ hadoop jar /usr/local/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 2






测试本地模式
hadoop默认情况下配置为本地模式,所以解压后不修改任何配置,可以执行本地测试
创建本地目录
mkdir /home/hdfs/inputtest/
填充数据
cp /usr/local/hadoop-2.2.0/etc/hadoop/*.xml /home/hdfs/inputtest/
执行hadoop
hadoop jar /usr/local/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar grep inputtest outputtest 'dfs[a-z.]+' 




结果截图:




六、hadoop集群运行状态监控


NameNode RPC server地址: master211/192.168.100.211:9000


web监控界面地址:
NameNode           http://192.168.100.211:50070/
SecondaryNameNode  http://192.168.100.211:50090
ResourceManager    http://192.168.100.211:8088/
Job Historyserver: http://192.168.100.211:19888/
NodeManager        http://192.168.100.212:8042/




七、遇到的问题及解决方案
1、保证集群时间一致
调整集群的时间实时同步并加入开机启动
[root@master211 ~]# crontab -l
##syn hadoop cluster time
0 * * * * /usr/sbin/ntpdate pool.ntp.org >>/root/ntp.log;/sbin/hwclock -w


[root@master211 ~]# cat /etc/rc.local    
#!/bin/sh
/usr/sbin/ntpdate pool.ntp.org >>/root/ntp.log;/sbin/hwclock -w




2、探究:数据丢失危险
2014-02-26 00:54:19,078 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory 
(dfs.namenode.name.dir) configured. Beware of dataloss due to lack of redundant storage directories!
2014-02-26 00:54:19,078 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage 
directory (dfs.namenode.edits.dir) configured. Beware of dataloss due to lack of redundant storage directories!


通过在dfs.namenode.name.dir和dfs.datanode.data.dir设置多个挂载在不同物理硬盘或者NFS挂载的目录即可。




3、因实验环境为虚机,分配的内存为1G,默认配置为8G,故提示超出可用内存的80%
2014-02-27 17:37:44,293 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2014-02-27 17:37:44,390 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: NodeManager configured with 8 G physical memory allocated to containers, which is more than 80% of the total physical memory available (996.8 M). Thrashing might happen.




4、相关目录不存在
2014-02-26 00:54:20,384 WARN org.apache.hadoop.hdfs.server.common.Storage: Storage directory /usr/local/hadoop-2.2.0/tmp/dfs/name does not exist
2014-02-26 00:54:20,401 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /usr/local/hadoop-2.2.0/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.










官方配置文档http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
http://blog.csdn.net/licongcong_0224/article/details/12972889
http://www.micmiu.com/bigdata/hadoop/hadoop2x-cluster-setup/
http://blog.csdn.net/codepeak/article/details/13170147
1 0