hadoop map/reduce setup

来源：互联网发布：mac开机黑屏问号文件夹编辑：程序博客网时间：2024/05/19 15:20

1, What Is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

    Hadoop Common: The common utilities that support the other Hadoop modules.
    Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
    Hadoop YARN: A framework for job scheduling and cluster resource management.
    Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

2, set single node cluster

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:


<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:


<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

In file conf/hdfs-site.xml:


<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

3, set namenode of hdfs

$ bin/hadoop namenode -format

4, step 3 depends on java

add these to hadoop-env.sh

export JAVA_HOME=/home/wu/mapreduce/jdk1.7.0_07

export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib

export PATH=$JAVA_HOME/bin:$PATH

5, disable ipv6

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

6, ssh localhost

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

7, set cluster

on master nodes

add contents to conf/masters, conf/slaves

8, modify the core-site.xml mapred-site.xml dfs-site.xml

on all nodes

value:localhost:xxx, change to the master host name or ip

dfs-site.xml change the replication copies

9, master node

./bin/start-dfs.sh

it will start all data nodes on slaves

10, master node

bin/start-mapred.sh

it will start job tracker on master, and start all task tracker on slaves