Hadoop V2.0.3 Cluster Setup Guide

来源:互联网 发布:js input显示url参数 编辑:程序博客网 时间:2024/04/30 09:37

Hadoop V2.0.3 Cluster Setup Guide

1 Single Node Cluster

1.1  Prerequisites

1.1.1        Sun Java installation

Hadoop requires a working java installed, here I use the Java 1.7.

Unzip the source file:

$ tar –xzvf jdk-7u17-linux-i586.gz 

Set Java environment variable:

$ vi /etc/profile

export JAVA_HOME = /usr/local/ jdk-7u17-linux-i586

export CLASSPATH=.:$ JAVA_HOME/lib/tools.jar:/lib/dt.jar

export PATH=$JAVA_HOME/bin:$PATH

$ chmod +x /etc/profile

$ source /etc/profile

Check whether java is OK:

$ java –version

 

1.1.2        Modify cluster machines’ names

Add the hostnames below to the end of the file /etc/hosts

        X.X.X.X master

        X.X.X.X slave_king (Add the nodes’ hostnames)

       

 

1.1.3        Install SSH

1.1.3.1 SSH to localhost

SSH is used to communicate between different Linux PCs.

$ sudo apt-get install ssh

Generate public key, this key is used to login cross different node without typing any password:

$ ssh-keygen –t rsa

$ cat ~/.ssh/id-rsa.pub >> ~/.ssh/authorized_keys

$ chmod 600 ~/.ssh/authorized_keys

To test the configuration, type the command

$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.

RSA key fingerprint is 06:57:fc:17:5a:ff:6a:8e:a2:56:d2:51:fb:5f:b8:d1.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

1.1.3.2 SSH to remote machines without typing password

It is very important for the communication through cluster. Here pick 2 machines as the example. Their hostnames are master and slave_king.

Copy the SSH public key file ~/.ssh/id-rsa.pub from master to slave_king (Here copy to slave_king:~/Desktop/)

Add the public key to the file authorized_keys on slave_king

master$ ssh-copy-id -i ~/.ssh/id_rsa.pub king@slave_king

Same operation on the slave_king, copy the slave_king SSH public key to master

slave_king$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@master

Test the connection:

master$ ssh king@slave_king

 

1.2  Install Hadoop V2.0.3

Unzip hadoop-2.0.3-alpha.tar.gz

$ tar –zxvf hadoop-2.0.3-alpha.tar.gz

Add the the user to hadoop folder

$ sudo chown hduser hadoop-2.0.3-alpha

1.3  Configure Hadoop V2.0.3

The conf files are under the $HADOOP_HOME/etc/hadoop. You can get the detail information of the labels on the conf files from http://hadoop.apache.org/docs/current/

Create the directories which will be used on the conf files and change the folder owner to hduser (hadoop user)

/home/hadoop/dfs/name

/home/hadoop/dfs/data

/home/hadoop/mapred/local

/home/hadoop/mapred/system

1.3.1        Configure slaves:

The file is under $HADOOP_HOME/etc/hadoop, add the content below

localhost

 

1.3.2        Configure hadoop-env.sh and add JAVA_HOME onto it.

$ vi /usr/local/hadoop-2.0.3-alpha/etc/hadoop/hadoop-env.sh

JAVA_HOME = /usr/local/jdk1.7

1.3.3        Configure core-site.xml

 

<configuration>

<property>

<name>io,native.lib.available</name>

<value>true</value>

</property>

 

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:8888</value>

<description>King's Hadoop</description>

<final>true</final>

</property>

 

</configuration>

 

 

1.3.4        Configure hdfs-site.xml

 

<configuration>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/home/hadoop/dfs/name</value>

<description>Determines where on the local filesystem the DFS name node should store the name table.If this is a comma-delimited list of directories,then name table is replicated in all of the directories,for redundancy.</description>

<final>true</final>

</property>

 

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/home/hadoop/dfs/data</value>

<description>Determines where on the local filesystem an DFS data node should store its blocks.If this is a comma-delimited list of directories,then data will be stored in all named directories,typically on different devices.Directories that do not exist are ignored.

</description>

<final>true</final>

</property>

 

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

 

<property>

<name>dfs.permission</name>

<value>false</value>

</property>

 

</configuration>

1.3.5        Configure mapred-site.xml

 <configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

 

<property>

<name>mapreduce.job.tracker</name>

<value>hdfs://localhost:9001</value>

<final>true</final>

</property>

 

<property>

<name>mapreduce.map.memory.mb</name>

<value>1536</value>

</property>

 

<property>

<name>mapreduce.map.java.opts</name>

<value>-Xmx1024M</value>

</property>

 

<property>

<name>mapreduce.reduce.memory.mb</name>

<value>3072</value>

</property>

 

<property>

<name>mapreduce.reduce.java.opts</name>

<value>-Xmx2560M</value>

</property>

 

<property>

<name>mapreduce.task.io.sort.mb</name>

<value>512</value>

</property>

 

<property>

<name>mapreduce.task.io.sort.factor</name>

<value>100</value>

</property>

 

<property>

<name>mapreduce.reduce.shuffle.parallelcopies</name>

<value>50</value>

</property>

 

<property>

<name>mapred.system.dir</name>

<value>file:/home/hadoop/mapred/system</value>

<final>true</final>

</property>

 

<property>

<name>mapred.local.dir</name>

<value>file:/home/hadoop/mapred/local</value>

<final>true</final>

</property>

</configuration>

1.3.6        Configure yarn-site.xml

 <configuration>

 

<property>

<name>yarn.resourcemanager.address</name>

<value>localhost:8080</value>

</property>

 

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>localhost:8081</value>

</property>

 

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>localhost:8082</value>

</property>

 

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce.shuffle</value>

</property>

 

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

 

</configuration>

1.4  Start Hadoop

Format Namenode

$ $HADOOP_HOME/bin/hdfs namenode –format

Start all

$ $HADOOP_HOME/sbin/start-all.sh

$ jps

11148 NameNode

11305 DataNode

12040 Jps

11832 NodeManager

11507 SecondaryNameNode

11676 ResourceManager

2. Multiple Node Cluster

Install hadoop as the steps above. Make sure that your system can SSH to the master without typing password.

On the slave:

Add master’s host IP together with hostname to the file /etc/hosts.

       127.0.0.1 localhost

       10.2.31.155 master

On the master:

Then, what you need to do is to add the slave’s host IP together with hostname to the file /etc/hosts and add the hostname to $HADOOP_HOME/etc/hadooop/slaves

Content on /etc/hosts

       127.0.0.1 localhost

       10.2.31.155 master

       10.2.10.132 slave_king

      

Content on $HADOOP_HOME/etc/hadooop/slaves

       localhost

       slave_king

 

 

 

Hadoop V1.0.3 Cluster Setup Guide

1 Single Node Cluster

1.1 Prerequisites

1.1.1 Sun Java 6 installation

Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. We use apt-get to install sun-java6-jdk, and the commands as below:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:ferramroberto/java

$ sudo apt-get update

$ sudo apt-get install sun-java6-jdk

After installation, make a quick check whether Sun’s JDK is correctly set up:

$ java -version

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) Client VM (build 20.1-b02, mixed mode, sharing)

 1.1.2 ssh/rsync Installation

$ sudo apt-get install ssh

$ sudo apt-get install rsync      

1.2 Add a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

$ sudo addgroup hadoop

$ sudo adduser -ingroup hadoop hduser

This will add the user hduser and the group hadoop to your local machine.

Generate an SSH key for the hduser user:

$ su - hduser

hduser@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu

The key's randomart image is:

hduser@ubuntu:~$

Enable SSH access to your local machine with this newly created key:

hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

 And you can use following command to test the SSH connection.

hduser@ubuntu:~$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.

RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux

Ubuntu 10.04 LTS

hduser@ubuntu:~$

1.3 Hadoop Installation

 Download hadoop-1.0.3.tar.gz version via http://labs.renren.com/apache-mirror/hadoop/common/hadoop-1.0.3/.

To use Hadoop, you just need to uncompress the tar file to designated directory. In general, the default Hadoop home is /usr/local/HADOOP-VERSION (i.e. Hadoop-1.0.3 in our case).

You need to change to root user to execute following commands.

$ cd /usr/local

$ sudo tar xzf /usr/local/hadoop-1.0.3.tar.gz

Change the Hadoop owner to the hduser user.

$ sudo chown -R hduser:hadoop hadoop-1.0.3

1.4 Configuration

1.4.1 hadoop-env.sh

The only required environment variable we have to configure for Hadoop is JAVA_HOME. Open /conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

# The java implementation to use.  Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to

# The java implementation to use.  Required.

export JAVA_HOME=/usr/lib/jvm/java-6-sun

1.4.2 conf

12/07/24 01:25:43 INFO util.GSet: VM type       = 32-bit

12/07/24 01:25:43 INFO util.GSet: 2% max memory = 19.33375 MB

12/07/24 01:25:43 INFO util.GSet: capacity      = 2^22 = 4194304 entries

12/07/24 01:25:43 INFO util.GSet: recommended=4194304, actual=4194304

12/07/24 01:25:45 INFO namenode.FSNamesystem: fsOwner=hduser

12/07/24 01:25:45 INFO namenode.FSNamesystem: supergroup=supergroup

12/07/24 01:25:45 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/07/24 01:25:45 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/07/24 01:25:45 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/07/24 01:25:45 INFO namenode.NameNode: Caching file names occuring more than 10 times

12/07/24 01:25:45 INFO common.Storage: Image file of size 112 saved in 0 seconds.

12/07/24 01:25:45 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.

12/07/24 01:25:45 INFO namenode.NameNode: SHUTDOWN_MSG:

 

1.5.2 Start Hadoop  

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/start-all.sh

starting namenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-namenode-ubuntu.out

localhost: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-datanode-ubuntu.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out

starting jobtracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out

localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

hduser@ubuntu:/usr/local/hadoop-1.0.3$

 This will start up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

You can check the running Hadoop processes use jps.    

hduser@ubuntu:/usr/local/hadoop$ jps

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. 

1.5.3 Download following example input data as text file to a directory (/tmp/input in our case)

http://www.gutenberg.org/etext/20417

http://www.gutenberg.org/etext/5000

http://www.gutenberg.org/etext/4300

1.5.4 Copy local example data to HDFS

Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

hduser@ubuntu:/usr/local/ hadoop-1.0.3$ bin/hadoop dfs -copyFromLocal /tmp/input /user/hduser/mapredjob

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hduser

Found 1 items

drwxr-xr-x   - hduser supergroup          0 2012-07-24 01:35 /user/hduser/mapredjob

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hduser/mapredjob

Found 3 items

-rw-r--r--   1 hduser supergroup     674387 2012-07-24 01:35 /user/hduser/mapredjob/pg20417.txt

-rw-r--r--   1 hduser supergroup    1317316 2012-07-24 01:35 /user/hduser/mapredjob/pg4300.txt

-rw-r--r--   1 hduser supergroup    1327971 2012-07-24 01:35 /user/hduser/mapredjob/pg5000.txt

hduser@ubuntu:/usr/local/hadoop-1.0.3$

1.5.5 Run the WordCount Example

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/mapredjob /user/hduser/mapredjob-output

12/07/24 01:39:44 INFO input.FileInputFormat: Total input paths to process : 3

12/07/24 01:39:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/07/24 01:39:45 WARN snappy.LoadSnappy: Snappy native library not loaded

12/07/24 01:39:45 INFO mapred.JobClient: Running job: job_201207240126_0001

12/07/24 01:39:46 INFO mapred.JobClient:  map 0% reduce 0%

12/07/24 01:40:07 INFO mapred.JobClient:  map 66% reduce 0%

12/07/24 01:40:13 INFO mapred.JobClient:  map 100% reduce 0%

12/07/24 01:40:22 INFO mapred.JobClient:  map 100% reduce 100%

12/07/24 01:40:27 INFO mapred.JobClient: Job complete: job_201207240126_0001

12/07/24 01:40:27 INFO mapred.JobClient: Counters: 29

12/07/24 01:40:27 INFO mapred.JobClient:   Job Counters

12/07/24 01:40:27 INFO mapred.JobClient:     Launched reduce tasks=1

12/07/24 01:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=33832

12/07/24 01:40:27 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0

12/07/24 01:40:27 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1489735680

12/07/24 01:40:27 INFO mapred.JobClient:     Map output records=569024

hduser@ubuntu:/usr/local/hadoop-1.0.3$

Check if the result is successfully stored in HDFS directory /user/hduser/mapredjob-output:

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hduserFound 2 items

drwxr-xr-x   - hduser supergroup          0 2012-07-24 01:35 /user/hduser/mapredjob

drwxr-xr-x   - hduser supergroup          0 2012-07-24 01:40 /user/hduser/mapredjob-output

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hduser/mapredjob-output

Found 3 items

-rw-r--r--   1 hduser supergroup          0 2012-07-24 01:40 /user/hduser/mapredjob-output/_SUCCESS

drwxr-xr-x   - hduser supergroup          0 2012-07-24 01:39 /user/hduser/mapredjob-output/_logs

-rw-r--r--   1 hduser supergroup     808341 2012-07-24 01:40 /user/hduser/mapredjob-output/part-r-00000

hduser@ubuntu:/usr/local/hadoop-1.0.3$

1.5.6 Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system.

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/mapredjob-output/part-r-00000

Or browsing through the web UI http://localhost:50075/browseBlock.jsp?blockId=-961065908477792285&blockSize=808341&genstamp=1013&filename=/user/hduser/mapredjob-output/part-r-00000&datanodePort=50010&namenodeInfoPort=50070.

1.5.7 Stop Hadoop

hduser@ubuntu:/usr/local/hadoop-1.0.3$ bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

hduser@ubuntu:/usr/local/hadoop-1.0.3$

2 Multi Node Cluster

2.1 Prerequisites

You have set up single-node cluster successfully according to above steps on each of the two Ubuntu boxes. It’s recommended that you use the “same settings” (e.g., installation locations and paths) on both machines, or otherwise you might run into problems later when we will migrate the two machines to the final multi-node cluster setup.

2.2 Networking

Both machines must be able to reach each other over the network. Then we will assign one computer (IP: 10.2.10.137) as master, and the other (IP:   10.2.10.142) as slave.

Update /etc/hosts on both machines with following lines:

10.2.10.137 master

10.2.10.142 slave

2.3 SSH access

In general, when the hduser on the master computer connects to hduser on the slave computer through SSH, you will prompt to input hduser’s password on the slave. To make it simple, you can add the hduser@master‘s public SSH key to the authorized_keys file of hduser@slave using the following command:

hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

Then test the connection from master to master…

hduser@master:~$ ssh master

The authenticity of host 'master (10.2.10.137)' can't be established.

RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'master' (RSA) to the list of known hosts.

Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686

...

hduser@master:~$

…and from master to slave.

hduser@master:~$ ssh slave

The authenticity of host 'slave (10.2.10.142)' can't be established.

RSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'slave' (RSA) to the list of known hosts.

Ubuntu 10.04

...

hduser@slave:~$

2.4 Configuration

2.4.1 conf/masters (master only)

Update conf/masters that it looks like this:

master

2.4.2 conf/slaves (master only)

Update conf/slaves that it looks like this:

master

slave

2.4.3 conf

Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y

Format aborted in /app/hadoop/tmp/dfs/name

12/07/24 23:00:39 INFO namenode.NameNode: SHUTDOWN_MSG:

 

When we met this issue, we do following steps:

1) Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir inconf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /app/hadoop/tmp/dfs/data

2) Reformat the NameNode (NOTE: all HDFS data is lost during this process!)

hduser@master: /usr/local/hadoop$ bin/hadoop namenode –format

12/07/24 23:04:17 INFO namenode.NameNode: STARTUP_MSG:

 

12/07/24 23:04:17 INFO util.GSet: VM type       = 32-bit

12/07/24 23:04:17 INFO util.GSet: 2% max memory = 19.84625 MB

12/07/24 23:04:17 INFO util.GSet: capacity      = 2^22 = 4194304 entries

12/07/24 23:04:17 INFO util.GSet: recommended=4194304, actual=4194304

12/07/24 23:04:18 INFO namenode.FSNamesystem: fsOwner=hduser

12/07/24 23:04:18 INFO namenode.FSNamesystem: supergroup=supergroup

12/07/24 23:04:18 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/07/24 23:04:18 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/07/24 23:04:18 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/07/24 23:04:18 INFO namenode.NameNode: Caching file names occuring more than 10 times

12/07/24 23:04:18 INFO common.Storage: Image file of size 112 saved in 0 seconds.

12/07/24 23:04:18 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.

12/07/24 23:04:18 INFO namenode.NameNode: SHUTDOWN_MSG:

 

2.5.2 Start the multi-node cluster

We can run bin/start-dfs.sh on master to use following command:

hduser@master:/usr/local/hadoop$ bin/start-all.sh

starting namenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-namenode-ubuntu.out

slave: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-datanode-ubuntu.out

master: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-datanode-ubuntu.out

master: starting secondarynamenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out

starting jobtracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out

slave: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

master: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

And on master can use jps to see the status:

hduser@master:/usr/local/hadoop$ jps

14422 Jps
14291 TaskTracker
14094 JobTracker
13656 NameNode
13830 DataNode
14008 SecondaryNameNode

hduser@master:/usr/local/hadoop$

Also on slave:

hduser@slave:~$ jps

8469 Jps

8104 DataNode

8252 TaskTracker

hduser@slave:/usr/local/hadoop$

2.5.3 Run the WordCount Example

Download following files as text files, copy them to HDFS (refer to “Running a Single-Node MapReduce Job”).

http://www.gutenberg.org/etext/20417

http://www.gutenberg.org/etext/5000

http://www.gutenberg.org/etext/4300

http://www.gutenberg.org/etext/132

http://www.gutenberg.org/etext/1661

http://www.gutenberg.org/etext/972

http://www.gutenberg.org/etext/19699

Run following commands:

hduser@master:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/mapredjob /user/hduser/mapredjob-output

Example output:

... INFO mapred.FileInputFormat: Total input paths to process : 7

... INFO mapred.JobClient: Running job: job_0001

... INFO mapred.JobClient:  map 0% reduce 0%

... INFO mapred.JobClient:  map 28% reduce 0%

... INFO mapred.JobClient:  map 57% reduce 0%

... INFO mapred.JobClient:  map 71% reduce 0%

... INFO mapred.JobClient:  map 100% reduce 9%

... INFO mapred.JobClient:  map 100% reduce 68%

... INFO mapred.JobClient:  map 100% reduce 100%

.... INFO mapred.JobClient: Job complete: job_0001

... INFO mapred.JobClient: Counters: 11

... INFO mapred.JobClient:   org.apache.hadoop.examples.WordCount$Counter

... INFO mapred.JobClient:     WORDS=1173099

... INFO mapred.JobClient:     VALUES=1368295

... INFO mapred.JobClient:   Map-Reduce Framework

... INFO mapred.JobClient:     Map input records=136582

... INFO mapred.JobClient:     Map output records=1173099

... INFO mapred.JobClient:     Map input bytes=6925391

... INFO mapred.JobClient:     Map output bytes=11403568

... INFO mapred.JobClient:     Combine input records=1173099

... INFO mapred.JobClient:     Combine output records=195196

... INFO mapred.JobClient:     Reduce input groups=131275

... INFO mapred.JobClient:     Reduce input records=195196

... INFO mapred.JobClient:     Reduce output records=131275

hduser@master:/usr/local/hadoop$

Datanode and tasktracker’s log on slave:

hduser@slave:/usr/local/hadoop$sudo cat logs/hadoop-hduser-datanode-slave.log

... INFO org.apache.hadoop.dfs.DataNode: Received block blk_5693969390309798974 from  /10.2.10.137

... INFO org.apache.hadoop.dfs.DataNode: Received block blk_7671491277162757352 from /10.2.10.137

...

---------------------------------------------------------------------------------------------------------------------

hduser@slave:/usr/local/hadoop$sudo cat logs/ hadoop-hduser-tasktracker-slave.log

.. INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000000_0

... INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000001_0

... task_0001_m_000001_0 0.08362164% hdfs://master:54310/user/hduser/gutenberg/ulyss12.txt:0+1561677

... task_0001_m_000000_0 0.07951202% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731

...

2.5.4 Stop the multi-node cluster

hduser@master:/usr/local/hadoop$ bin/stop-all.sh

0 0
原创粉丝点击