hadoop 0.20 布署的补充

来源：互联网发布：联通机顶盒安装软件编辑：程序博客网时间：2024/05/17 02:55

最近在自己的局域网中又全新安装了hadoop0.20版本，感觉和0.19版本还是有一些变化的。

20版原包中默认取消了hadoop-default.xml配置文件，取而代之的是三个配置文件：

core-site.xml

mapred-site.xml

hdfs-site.xml

默认的这三个文件都是空的，也就是说，这些配置的全局默认值已经在代码中写死了，我们在配置文件中写的是和默认值不同的选项，会覆盖默认选项。

不同的配置选项要放在相应的文件中，不能放错地方。

hadoop 0.20官方英文文档中告诉我们了该怎么写（注意，是英文文档，20版提供了中文文档，但是里面的内容都是旧内容，我就是看了中文文档所以走了不少弯路），参考：http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

原文摘录如下：

This section deals with important parameters to be specified in the following:
conf/core-site.xml:

Parameter Value Notes fs.default.nameURI of NameNode.hdfs://hostname/

conf/hdfs-site.xml:

Parameter Value Notes dfs.name.dir Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. dfs.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

conf/mapred-site.xml:

Parameter Value Notes mapred.job.trackerHost or IP and port of JobTracker.host:port pair.mapred.system.dir Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. mapred.local.dir Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. Multiple paths help spread disk i/o.mapred.tasktracker.{map|reduce}.tasks.maximum The maximum number of Map/Reduce tasks, which are run simultaneously on a given TaskTracker, individually. Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware. dfs.hosts/dfs.hosts.excludeList of permitted/excluded DataNodes. If necessary, use these files to control the list of allowable datanodes. mapred.hosts/mapred.hosts.excludeList of permitted/excluded TaskTrackers. If necessary, use these files to control the list of allowable TaskTrackers. mapred.queue.namesComma separated list of queues to which jobs can be submitted. The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same. mapred.acls.enabledSpecifies whether ACLs are supported for controlling job submission and administration If true, ACLs would be checked while submitting and administering jobs. ACLs can be specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below. mapred.queue.queue-name.acl-submit-jobList of users and groups that can submit jobs to the specified queue-name. The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. mapred.queue.queue-name.acl-administer-jobList of users and groups that can change the priority or kill jobs that have been submitted to the specified queue-name. The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that an owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs.

Typically all the above parameters are marked as final to ensure that they cannot be overriden by user-applications.


在局域网里，每个机器往往都用用户的名字命名，如John-desktop，但是我们在分布式系统中，通常希望用master, slave001,slave002这样的命名规则来命名机器，
这样我们需要编辑/etc/hosts文件，把每一台机器希望的命名都写进去，如：
192.168.1.10     John-desktop
192.168.1.10     master
192.168.1.11     Peter-desktop
192.168.1.11     slave001
依此类推。因为在hadoop中，系统会自动取当前机器名（用hostname），这时，如果hostname不是master, slave001这样的名字，网络通信就会出问题。

以下是我的配置文件

注：我有两台机器，主机IP：192.168.1.10 从机IP：192.168.1.11



core-site.xml:

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration> <property>  <name>fs.default.name</name>  <value>hdfs://master/</value>  <description>The name of the default file system.  A URI whose  scheme and authority determine the FileSystem implementation.  The  uri's scheme determines the config property (fs.SCHEME.impl) naming  the FileSystem implementation class.  The uri's authority is used to  determine the host, port, etc. for a filesystem.</description> </property> <property>  <name>hadoop.tmp.dir</name>  <value>/home/hadoop/hdfs</value> </property></configuration> 


hdfs-site.xml:

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property>  <name>dfs.name.dir</name>  <value>${hadoop.tmp.dir}/dfs/name</value>  <description>Determines where on the local filesystem the DFS name node      should store the name table(fsimage).  If this is a comma-delimited list      of directories then the name table is replicated in all of the      directories, for redundancy. </description></property><property>  <name>dfs.data.dir</name>  <value>${hadoop.tmp.dir}/dfs/data</value>  <description>Determines where on the local filesystem an DFS data node  should store its blocks.  If this is a comma-delimited  list of directories, then data will be stored in all named  directories, typically on different devices.  Directories that do not exist are ignored.  </description></property></configuration>
 



mapred-site.xml

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property>  <name>mapred.job.tracker</name>  <value>192.168.1.10:9001</value>  <description>The host and port that the MapReduce job tracker runs  at.  If "local", then jobs are run in-process as a single map  and reduce task.  </description></property><property>  <name>mapred.system.dir</name>  <value>${hadoop.tmp.dir}/mapred/system</value>  <description>The shared directory where MapReduce stores control files.  </description></property><property>  <name>mapred.local.dir</name>  <value>${hadoop.tmp.dir}/mapred/local</value>  <description>The local directory where MapReduce stores intermediate  data files.  May be a comma-separated list of  directories on different devices in order to spread disk i/o.  Directories that do not exist are ignored.  </description></property></configuration>
 

另外，
在conf/hadoop-env.sh里，要把JAVA_HOME环境变量指向JDK路境，尽管可能在.profile中已经设置过了，这里还是要设一下，不然有时会提示“没有指定JAVA_HOME"