hadoop源码编译、配置安装、测试

来源:互联网 发布:hr抢购软件 编辑:程序博客网 时间:2024/05/16 11:32

一、 hadoop架构简介

Hadoop 是一个在大数据领域应用最广泛的、稳定可靠的、可扩展的用于分布式并行计算的开源软件。Hadoop 使用简洁的 MapReduce 编程模型来分布式的处理跨集群的大型数据集,集群规模可以扩展到几千甚至几万。相比于依赖昂贵的 硬件来实现高可用性,Hadoop是在假设每台机器都会出错的情况下,从软件层面来实现错误的检测和处理。Hadoop 集群服务包含:HDFS 分布式文件系统、Yarn 任务调度和集群资源管理系统以及 MapReduce 并行分析计算系统。关于Hadoop 更多的详细信息,可参阅 Hadoop 官方网站 。

Hadoop 集群采用的是 master/slave 架构。 如下图所示,Hadoop 集群分三种节点类型:主节点 (Yarn Resource Manager 和 HDFS Name Node),从节点 (Yarn Node Manager 和 HDFS Data Node) 和客户端节点 (Hadoop Client Node)。 用户在客户端节点发起Map Reduce 任务,通过与 HDFS 和 Yarn 集群中各节点的交互存取文件和执行 MapReduce 任务,获取结果。


二、 hadoop源码编译

1. java安装配置

$ sudo yum -y install java$ vim /etc/profileexport JAVA_HOME=/usr/lib/javaexport CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATHexport PATH=$JAVA_HOME/bin:$PATH

2. maven安装配置

$ cat ~/.m2/settings.xml<settings>    <mirrors>        <mirror>            <id>alimaven</id>            <name>aliyun maven</name>            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>            <mirrorOf>central</mirrorOf>        </mirror>    </mirrors></settings>

3. protobuf安装配置

$ tar -xzvf protobuf-2.5.0.tar.gz$ cd  protobuf-2.5.0$ ./configure$ make$ make install$ protoc --versionlibprotoc 2.5.0
4. 其他依赖软件安装配置

autoconf、automake、cmake、libtool

5. hadoop源码编译

$ git clone git://git.apache.org/hadoop.git$ mvn clean package -Pdist,native -DskipTests -Dtar$ ls hadoop-dist/target/


三、 hadoop配置

1. 配置主机名

$ vim /etc/hosts192.168.11.10   DataWorks.Master192.168.11.11   DataWorks.Node1192.168.11.12   DataWorks.Node2
2. 添加hadoop账户,配置主机之间hadoop账户免密码登录

3. hadoop配置

$ tar -zxvf hadoop-2.6.2.tar.gz$ mv hadoop-2.6.2 hadoop这里要涉及到的配置文件有7个: ~/hadoop/etc/hadoop/hadoop-env.sh ~/hadoop/etc/hadoop/yarn-env.sh ~/hadoop/etc/hadoop/slaves ~/hadoop/etc/hadoop/core-site.xml ~/hadoop/etc/hadoop/hdfs-site.xml ~/hadoop/etc/hadoop/mapred-site.xml ~/hadoop/etc/hadoop/yarn-site.xml
配置java环境变量:

$ vim ~/hadoop/etc/hadoop/hadoop-env.shexport JAVA_HOME=/usr/lib/java$ vim ~/hadoop/etc/hadoop/yarn-env.shexport JAVA_HOME=/usr/lib/java
配置worker节点ip/hostname地址:

$ vim ~/hadoop/etc/hadoop/slavesDataWorks.Node1DataWorks.Node2

配置HDFS master节点:

$ vim mkdir ~/hadoop/tmp$ vim ~/hadoop/etc/hadoop/core-site.xml<configuration>    <property>        <name>fs.default.name</name>        <value>hdfs://DataWorks.Master:9000</value>    </property>    <property>        <name>hadoop.tmp.dir</name>        <value>/home/hadoop/hadoop/tmp</value>    </property>    <property>        <name>io.file.buffer.size</name>        <value>131702</value>    </property></configuration>

配置namenode主从节点:

$ vim ~/hadoop/etc/hadoop/hdfs-site.xml<configuration>    <property>        <name>dfs.http.address</name>        <value>DataWorks.Master:50070</value>    </property>    <property>        <name>dfs.namenode.secondary.http-address</name>        <value>DataWorks.Master:50090</value>    </property>    <property>        <name>dfs.webhdfs.enabled</name>        <value>true</value>    </property>    <property>        <name>dfs.replication</name>        <value>3</value>    </property>    <property>        <name>dfs.block.size</name>        <value>16777216</value>    </property>     <property>        <name>dfs.datanode.max.xcievers</name>        <value>4096</value>    </property>    <property>        <name>dfs.namenode.name.dir</name>        <value>file:/home/hadoop/hadoop/dfs/name</value>    </property>    <property>        <name>dfs.datanode.data.dir</name>        <value>file:/home/hadoop/hadoop/dfs/data</value>    </property></configuration>

配置mapreduce:

$ vim ~/hadoop/etc/hadoop/mapred-site.xml<configuration>    <property>        <name>mapred.job.tracker</name>        <value>DataWorks.Master:9001</value>    </property>    <property>        <name>mapred.map.tasks</name>        <value>20</value>    </property>    <property>        <name>mapred.reduce.tasks</name>        <value>4</value>    </property>    <property>        <name>mapreduce.map.memory.mb</name>        <value>4096</value>    </property>    <property>        <name>mapreduce.reduce.memory.mb</name>        <value>8192</value>    </property>    <property>        <name>mapreduce.map.java.opts</name>        <value>-Xmx3072m</value>    </property>    <property>        <name>mapreduce.reduce.java.opts</name>        <value>-Xmx6144m</value>    </property>    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>    <property>        <name>mapreduce.jobhistory.address</name>        <value>DataWorks.Master:10020</value>    </property>    <property>        <name>mapreduce.jobhistory.webapp.address</name>        <value>DataWorks.Master:19888</value>    </property></configuration>

配置yarn:
$ vim ~/hadoop/etc/hadoop/yarn-site.xml<configuration>    <!-- Site specific YARN configuration properties -->    <property>        <name>yarn.resourcemanager.address</name>        <value>DataWorks.Master:8032</value>    </property>    <property>        <name>yarn.resourcemanager.scheduler.address</name>        <value>DataWorks.Master:8030</value>    </property>    <property>        <name>yarn.resourcemanager.webapp.address</name>        <value>DataWorks.Master:8088</value>    </property>    <property>        <name>yarn.resourcemanager.resource-tracker.address</name>        <value>DataWorks.Master:8031</value>    </property>    <property>        <name>yarn.resourcemanager.admin.address</name>        <value>DataWorks.Master:8033</value>    </property>    <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>    </property>    <property>        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>        <value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>    <property>        <name>yarn.application.classpath</name>        <value>/home/hadoop/hadoop/share/hadoop/common/*,               /home/hadoop/hadoop/share/hadoop/common/lib/*,               /home/hadoop/hadoop/share/hadoop/hdfs/*,               /home/hadoop/hadoop/share/hadoop/hdfs/lib/*,               /home/hadoop/hadoop/share/hadoop/yarn/*,               /home/hadoop/hadoop/share/hadoop/yarn/lib/*,               /home/hadoop/hadoop/share/hadoop/mapreduce/*,               /home/hadoop/hadoop/share/hadoop/mapreduce/lib/*</value>    </property></configuration>
四、 hadoop安装运行

1. 部署

$ pscp -r -h all_iplist hadoop /home/hadoop/

2. 运行

$ cd ~/hadoop$ ./bin/hdfs namenode -format   #格式化namenode$ ./sbin/start-all.sh

3. 检查

$ jps185874 NameNode201765 Jps187541 ResourceManager186570 SecondaryNameNode

http://DataWorks.Master:50070 


五、 hadoop测试

1. hdfs功能测试

$ ./bin/hdfs dfs -ls /$ ./bin/hdfs dfs -mkdir /upload$ ./bin/hdfs dfs -put debian-64-gshell-livecd.iso /upload/$ ./bin/hdfs dfs -ls /upload/Found 1 items-rw-r--r--   3 admin supergroup  629864448 2017-07-28 09:30 /upload/debian-64-gshell-livecd.iso$ md5sum debian-64-gshell-livecd.isofd82dff2ffd326ac8c44bdd799b6018b  debian-64-gshell-livecd.iso$ rm -rf debian-64-gshell-livecd.iso$ ./bin/hdfs dfs -get /upload/debian-64-gshell-livecd.iso .$ md5sum debian-64-gshell-livecd.isofd82dff2ffd326ac8c44bdd799b6018b  debian-64-gshell-livecd.iso
$ ./bin/hdfs dfsadmin -report
2. mapreduce功能测试

./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1-SNAPSHOT.jar wordcount /upload/* /wordcount

3. benchmark性能基准测试

这个测试来自 Hadoop 官方的 Benchmark 性能基准测试,测试的是 HDFS 分布式I/O读写的速度/吞吐率,依次执行下列命令:

# 使用6个 Map 任务并行向 HDFS 里6个文件里分别写入 1GB 的数据./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-beta1-SNAPSHOT-tests.jar TestDFSIO -write -nrFiles 6 -size 1GB# 使用6个 Map 任务并行从 HDFS 里6个文件里分别读取 1GB 的数据./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-beta1-SNAPSHOT-tests.jar TestDFSIO -read -nrFiles 6 -size 1GB# 清除以上生成的数据./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-beta1-SNAPSHOT-tests.jar TestDFSIO -clean
4. 测试大文件内容排序
# 生成1000万行数据到 /teraInput 路径中./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1-SNAPSHOT.jar teragen 10000000 /teraInput# 将/teraInput 中生成的1000万行数据排序后存入到 /teraOutput 路径中./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1-SNAPSHOT.jar terasort teraInput /teraOutput# 针对已排序的 /teraOutput 中的数据,验证每一行的数值要小于下一行./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1-SNAPSHOT.jar teravalidate -D mapred.reduce.tasks=8 /teraOutput /teraValidate# 查看验证的结果./bin/hdfs dfs -cat teraValidate/part-r-00000