精通HADOOP（四） - 初识Hadoop - 安装Hadoop

来源：互联网发布：天刀装备词缀大全数据编辑：程序博客网时间：2024/06/04 20:00

1.1 安装Hadoop

和其他的软件一样，使用Hadoop需要一些先决条件。如果你安装了Cywin，在Windows上执行和开发Hadoop应用程序也是可能。但是，我们强烈建议你使用Linux作为运行Hadoop的产品平台。

请注意，你需要有Linux和Java的基础知识才能使用Hadoop。我们使用Bash脚本来启动这本书的样例程序。

1.1.1 安装的前提条件

我们需要在下列的环境下运行这本书的样例程序，

Fedora 8
Sun Java 6
Hadoop 0.19.0 or 更新版本

早于0.18.2 的Hadoop版本并不是通用的，我们不能在这些版本上编译本书的样例程序。早于1.6版本的Java并不支持所有Hadoop内核所需要的语言特征。除此之外，Hadoop核心似乎在Sun JDK上会表现出更好的性能。我们看到经常会有其他生产商的JDK用户要求提供帮助。这本书后续章节中的样例程序是基于Hadoop 0.17.0，这需要JDK1.6。

Hadoop能够运行在任何现代的Linux操作系统上。我更喜欢Red Hat, Fedora和CentOS上使用的红帽包管理系统(RPM)，于是，本书样例代码就借鉴了基于RPM的安装过程。

一个具有大批用户量的Fedora项目提供了torrents（从BitTorrent下载）去下载Fedora的各个版本(http://torrent.fedoraproject.org/)。如果你想跳过更新过程，Fedora联盟提供了一个具有更新的合一版本。你能从http://spins.fedoraunity.org/spins网址下载它。这就是所谓的re-spins。他们并不提供更早的版本的发布包。这些re-spins需要客户化下载工具Jigdo才能下载。

如果你是Linux入门用户，而且你想要下载试用，Live CD和具有持久存储的USB Stick能够帮助你启动一个简单而快速的测试环境。对于富有经验的客户，他们可以在http://www.vmware.com/appliances/directory/cat/45?sort=changed下载VMware Linux安装镜像。

1.1.1.1 在Linux下安装Hadoop

在你安装了Linux操作系统以后，我们必须决定在哪里安装JDK，因为我们需要JDK的安装路径来设置JAVA_HOME和PATH环境变量。

你可以使用具有一定选项的RPM命令获得RPM包包含文件的信息。这些命令是，-q用于查询文件， -l用于列出所有文件信息，-p用于指定你正在查询包的路径。然后，使用egrep查找字符串’/bin/javac$’，这个egrep命令用来在前面命令的输出中查找一个简单的正则表达式。

cloud9: ~/Downloads$ rpm -q -l -p ~/Downloads/jdk-6u7-linux-i586.rpm | egrep '/bin/javac$'

在我的机器中，输出是，

/usr/java/jdk1.6.0_07/bin/javac

请注意，在字符串/bin/javac$上的单引号是必不可少的。如果你不使用单引号，或者使用了双引号，Shell解释器就会把$解释做为一个环境变量。

我们假设我们在~/Downloads目录下执行JDK安装程序，安装程序在当前的工作目录解压绑定的RPM文件。

输出表明JDK被安装在/usr/java/jdk1.6.0_07，Java可执行程序在/usr/java/jdk1.6.0_07/bin下。

在你的.bashrc或者.bash_profile里面添加下面的两行 :

export JAVA_HOME=/usr/java/jdk1.6.0_07

export PATH=${JAVA_HOME}/bin:${PATH}

列表1-1是update_env.sh脚本，这个脚本能够为你配置Hadoop(你能够从这本书所附带的代码中找到这个脚本)。在执行这个脚本之前，请下载JDK的RPM安装包。

列表1-1 update_env.sh脚本

#! /bin/sh# This script attempts to work out the installation directory of the jdk,# given the installer file.# The script assumes that the installer is an rpm based installer and# that the name of the downloaded installer ends in# -rpm-bin## The script first attempts to verify there is one argument and the# argument is an existing file# The file may be either the installer binary, the -rpm.bin# or the actual installation rpm that was unpacked by the installer## The script will use the rpm command to work out the# installation package name from the rpm file, and then# use the rpm command to query the installation database,# for where the files of the rpm were installed.# This query of the installation is done rather than# directly querying the rpm, on the off# chance that the installation was installed in a different root# directory than the default.# Finally, the proper environment set commands are appended# to the user's .bashrc and .bash_profile file, if they exist, and# echoed to the standard out so the user may apply them to# their currently running shell sessions.# Verify that there was a single command line argument# which will be referenced as $1if [ $# != 1 ]; thenecho "No jdk rpm specified"echo "Usage: $0 jdk.rpm" 1>&2exit 1fi# Verify that the command argument exists in the file systemif [ ! -e $1 ]; thenecho "the argument specified ($1) for the jdk rpm does not exist" 1>&2exit 1fi# Does the argument end in '-rpm.bin' which is the suggested install# file, is the argument the actual .rpm file, or something else# set the variable RPM to the expected location of the rpm file that# was extracted from the installer fileif echo $1 | grep -q -e '-rpm.bin'; thenRPM=`dirname $1`/`basename $1 -rpm.bin`.rpmelif echo $1 | grep -q -e '.rpm'; thenRPM=$1elseecho -n "$1 does not appear to be the downloaded rpm.bin file or" 1>&2echo " the extracted rpm file" 1>&2exit 1fi# Verify that the rpm file exists and is readableif [ ! -r $RPM ]; thenecho -n "The jdk rpm file (${RPM}) does not appear to exist" 1>&2echo -n " have you run "sh ${RPM}" as root?" 1>&2exit 1fi# Work out the actual installed package name using the rpm command#. man rpm for detailsINSTALLED=`rpm -q --qf %{Name}-%{Version}-%{Release} -p ${RPM}`if [ $? -ne 0 ]; then(echo -n "Unable to extract package name from rpm (${RPM}),"Echo " have you installed it yet?") 1>&2exit 1fi# Where did the rpm install process place the java compiler program 'javac'JAVAC=`rpm -q -l ${INSTALLED} | egrep '/bin/javac$'`# If there was no javac found, then issue an errorif [ $? -ne 0 ]; then(echo -n "Unable to determine the JAVA_HOME location from $RPM, "echo "was the rpm installed? Try rpm -Uvh ${RPM} as root.") 1>&2exit 1fi# If we found javac, then we can compute the setting for JAVA_HOMEJAVA_HOME=`echo $JAVAC | sed -e 's;/bin/javac;;'`echo "The setting for the JAVA_HOME environment variable is ${JAVA_HOME}"echo -n "update the user's .bashrc if they have one with the"echo " setting for JAVA_HOME and the PATH."if [ -w ~/.bashrc ]; thenecho "Updating the ~/.bashrc file with the java environment variables";(echo export JAVA_HOME=${JAVA_HOME} ;echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bashrcechofiecho -n "update the user's .bash_profile if they have one with the"echo " setting for JAVA_HOME and the PATH."if [ -w ~/.bash_profile ]; thenecho "Updating the ~/.bash_profile file with the java environment variables";(echo export JAVA_HOME=${JAVA_HOME} ;echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bash_profileechofiecho "paste the following two lines into your running shell sessions"echo export JAVA_HOME=${JAVA_HOME}echo export PATH='${JAVA_HOME}'/bin:'${PATH}'

执行上面列表1-1的脚本就会找到JDK的安装目录，然后，更新你的环境变量，使这个安装的JDK能够被使用。

update_env.sh "FULL_PATH_TO_DOWNLOADED_JDK"

./update_env.sh ~/Download/jdk-6u7-linux-i586-rpm.bin

The setting for the JAVA_HOME environment variable is /usr/java/jdk1.6.0_07

update the user's .bashrc if they have one with the setting ➥
for JAVA_HOME and the PATH.

Updating the ~/.bashrc file with the java environment variables

update the user's .bash_profile if they have one with the setting ➥
for JAVA_HOME and the PATH.

Updating the ~/.bash_profile file with the java environment variables

paste the following two lines into your running shell sessions

export JAVA_HOME=/usr/java/jdk1.6.0_07

export PATH=${JAVA_HOME}/bin:${PATH}

1.1.1.2 在Windows下安装Hadoop：方法和常见问题

为了在Windows操作系统上使用Hadoop, 你需要先安装Sun JDK和Cygwin环境(你能够从http://sources.redhat.com/cywin下载Cygwin)。

通过点击图2-3所示的图标开始运行Cygwin Bash Shell脚本。你需要在JDK安装目录和~/Java所在的目录下建立一个符号链接，这样，当你执行cd ~/java的时候，目录就会改变到JDK的安装目录。因此，JAVA_HOME目录应该设置为JAVA_HOME=~/java。这样你的进程会根据进程的环境变量找到你的java可执行程序，例如，Hadoop需要找到Java安装目录去执行相应的任务。

clip_image002

列表 2-3 Cygwin Bash Shell图标

如果JAVA_HOME环境变量指向的路径包含空格，bin/hadoop脚本就不能正常执行。通常情况下我们在 C:/Program Files/java/jdkRELEASE_VERSION下安装JDK。如果我们做一个符号链接，然后，把JAVA_HOME指向到这个符号链接, bin/hadoop就会正常工作。我通常这样设置我的Cygwin安装目录的，

$echo $JAVA_HOME

/home/Jason/jdk1.6.0_12

$ls –l /home/Jason/jdk1.6.0_12

lrwxrwxrwx 1 Jason None 43 Mar 20 16:32 /home/Jason/jdk1.6.0_12 ➥

/cygdrive/c/Program Files/Java/jdk1.6.0_12/

Cygwin映射Windows磁盘字符到/cygdrive/X，X是磁盘的盘符。此外，Cygwin路径的分隔符是“/”，而Windows的路径分隔符是“/”。

当你执行bin/hadoop脚本的时候，你必须记得你的文件有两套路径，bin/haoop脚本和所有的Cygwin实用程序使用Windows文件系统的一个子系统的路径。这个子系统把Windows磁盘映射到/cygdrive目录下。然而，Windows程序看见传统的C:/文件系统。以/tmp为例，在一个标准的Cygwin安装里，/tmp也是C:/cywin/tmp目录。Java将要转换/tmp作为C:/tmp，他们是一个完全不同的目录。如果你从Cygwin里启动Windows应用程序，并且出现文件没有找到错误，那么通常情况下是这个应用程序（例如Java可执行程序）在一个错误的路径下查找文件。

请注意，你可能会需要在你的系统中对Cygwin的安装有所改变。这根据Sun JDK的安装和Windows的安装环境的不同而有所不同。特别是用户名可能不是Jason，JDK版本也可能不是1.6.0_12, 而且JDK安装位置可能也不是C:/Program Files/Java。

1.1.2 安装Hadoop

当你安装了Linux操作系统或者带有Cygwin的Windows操作系统，下一步你应该下载和安装Hadoop。

打开Hadoop下载网址http://www.apache.org/dyn/closer.cgi/hadoop/core/。在这个网址上找到你选择的tar.gz文件包，相信你还记得我在介绍章节所说的那个文件，然后下载它。

如果你是一个细腻的人，你需要回到这个网址，得到这个文件的PGP 摘要和MD5摘要。

解压这个Tar文件在任何一个你想要作为测试目的的安装目录里。通常我把它解压到一个私人根目录下的src目录，

~jason/src.

mkdir ~src

cd ~/src

tar zxf ~/Downloads/hadoop-0.19.0.tar.gz

这会在~/src目录里创建一个新的目录hadoop-0.19.0。

在你的.bashrc或者.bash_profile文件里添加如下两行：

export HADOOP_HOME=~/src/hadoop-0.19.0

export PATH=${HADOOP_HOME}/bin:${PATH}

如果你使用的是一个不同于~/src的目录，你需要根据你选择的路径调整这些export语句。

1.1.3 检查你的环境

安装了Hadoop以后，你应该检查是否你已经正确的设置了JAVA_HME和HADOOP_HOME环境变量。你的PATH环境变量应该包含${JAVA_HOME}/bin和${HADOOP_HOME}/bin，并且，他们应该在任何其他Java和Hadoop安装变量的前面，最好放在PATH的第一个元素，此外，你的Shell的默认工作目录应该是${HADOOP_HOME}。你需要这些设置来执行这本书的样例程序。

列表1-2所示的check_basic_env.sh脚本会校验你的执行时环境（你能够在本书附带的下载样例程序代码中能够找到这个脚本）。

列表 3-2 update_env.sh脚本

#! /bin/sh# This block is trying to do the basics of checking to see if# the HADOOP_HOME and the JAVA_HOME variables have been set correctly# and if they are not been set, suggest a setting in line with the earlier examples# The script actually tests for:# the presence of the java binary and the hadoop script,# and verifies that the expected versions are present# that the version of java and hadoop is as expected (warning if not)# that the version of java and hadoop referred to by the# JAVA_HOME and HADOOP_HOME environment variables are default version to run.### The 'if [' construct you see is a shortcut for 'if test' ....# the -z tests for a zero length string# the -d tests for a directory# the -x tests for the execute bit# -eq tests numbers# = tests strings# man test will describe all of the options# The '1>&2' construct directs the standard output of the# command to the standard error stream.if [ -z "$HADOOP_HOME" ]; thenecho "The HADOOP_HOME environment variable is not set" 1>&2if [ -d ~/src/hadoop-0.19.0 ]; thenecho "Try export HADOOP_HOME=~/src/hadoop-0.19.0" 1>&2fiexit 1;fi# This block is trying to do the basics of checking to see if# the JAVA_HOME variable has been set# and if it hasn't been set, suggest a setting in line with the earlier examplesif [ -z "$JAVA_HOME" ]; thenecho "The JAVA HOME environment variable is not set" 1>&2if [ -d /usr/java/jdk1.6.0_07 ]; thenecho "Try export JAVA_HOME=/usr/java/jdk1.6.0_07" 1>&2fiexit 1fi# We are now going to see if a java program and hadoop programs# are in the path, and if they are the ones we are expecting.# The which command returns the full path to the first instance# of the program in the PATH environment variable#JAVA_BIN=`which java`HADOOP_BIN=`which hadoop`# Check for the presence of java in the path and suggest an# appropriate path setting if java is not foundif [ -z "${JAVA_BIN}" ]; thenecho "The java binary was not found using your PATH settings" 1>&2if [ -x ${JAVA_HOME}/bin/java ]; thenecho 'Try export PATH=${JAVA_HOME}/bin' 1>&2fiexit 1fi# Check for the presence of hadoop in the path and suggest an# appropriate path setting if java is not foundif [ -z "${HADOOP_BIN}" ]; thenecho "The hadoop binary was not found using your PATH settings" 1>&2if [ -x ${HADOOP_HOME}/bin/hadoop ]; thenecho 'Try export PATH=${HADOOP_HOME}/bin:${PATH}' 1>&2fiexit 1fi# Double check that the version of java installed in ${JAVA_HOME}# is the one stated in the examples.# If you have installed a different version your results may vary.#if ! ${JAVA_HOME}/bin/java -version 2>&1 | grep -q 1.6.0_07; then(echo -n "Your JAVA_HOME version of java is not the"echo -n " 1.6.0_07 version, your results may vary from"echo " the book examples.") 1>&2fi# Double check that the java in the PATH is the expected version.if ! java -version 2>&1 | grep -q 1.6.0_07; then(echo -n "Your default java version is not the 1.6.0_07 "echo -n "version, your results may vary from the book"echo " examples.") 1>&2fi# Try to get the location of the hadoop core jar file# This is used to verify the version of hadoop installedHADOOP_JAR=`ls -1 ${HADOOP_HOME}/hadoop-0.19.0-core.jar`HADOOP_ALT_JAR=`ls -1 ${HADOOP_HOME}/hadoop-*-core.jar`# If a hadoop jar was not found, either the installation# was incorrect or a different version installedif [ -z "${HADOOP_JAR}" -a -z "${HADOOP_ALT_JAR}" ]; then(echo -n "Your HADOOP_HOME does not provide a hadoop"echo -n " core jar. Your installation probably needs"echo -n " to be redone or the HADOOP_HOME environment"echo variable needs to be correctly set.") 1>&2exit 1fiif [ -z "${HADOOP_JAR}" -a ! -z "${HADOOP_ALT_JAR}" ]; then(echo -n "Your hadoop version appears to be different"echo -n " than the 0.19.0 version, your results may vary"echo " from the book examples.") 1>&2fiif [ `pwd` != ${HADOOP_HOME} ]; then(echo -n 'Please change your working directory to"echo -n " ${HADOOP_HOME}. cd ${HADOOP_HOME} <Enter>") 1>&2exit 1fiecho "You are good to go"echo -n "your JAVA_HOME is set to ${JAVA_HOME} which "echo "appears to exist and be the right version for the examples."echo -n "your HADOOP_HOME is set to ${HADOOP_HOME} which "echo "appears to exist and be the right version for the examples."echo "your java program is the one in ${JAVA_HOME}"echo "your hadoop program is the one in ${HADOOP_HOME}"echo -n "The shell current working directory is ${HADOOP_HOME} "echo "as the examples require."if [ "${JAVA_BIN}" = "${JAVA_HOME}/bin/java" ]; thenecho "Your PATH appears to have the JAVA_HOME java program as the default java."elseecho -n "Your PATH does not appear to provide the JAVA_HOME"echo " java program as the default java."fiif [ "${HADOOP_BIN}" = "${HADOOP_HOME}/bin/hadoop" ]; thenecho -n "Your PATH appears to have the HADOOP_HOME"echo " hadoop program as the default hadoop."elseecho -n "Your PATH does not appear to provide the the HADOOP_HOME "echo "hadoop program as the default hadoop program."fiexit 0

然后执行脚本：

[scyrus@localhost ~]$ ./check_basic_env.sh

Please change your working directory to ${HADOOP_HOME}. cd ➥
${HADOOP_HOME} <Enter>

[scyrus@localhost ~]$ cd $HADOOP_HOME
[scyrus@localhost hadoop-0.19.0]$
[scyrus@localhost hadoop-0.19.0]$ ~/check_basic_env.sh

You are good to go
your JAVA_HOME is set to /usr/java/jdk1.6.0_07 which appears to exist and be the right version for the examples.
your HADOOP_HOME is set to /home/scyrus/src/hadoop-0.19.0 which appears
to exist and be the right version for the examples.
your java program is the one in /usr/java/jdk1.6.0_07
your hadoop program is the one in /home/scyrus/src/hadoop-0.19.0
The shell current working directory is /home/scyrus/src/hadoop-0.19.0 as
the examples require.
Your PATH appears to have the JAVA_HOME java program as the default
java.
Your PATH appears to have the HADOOP_HOME hadoop program as the default
hadoop.