My Hadoop: Hadoop 0.23 setup

来源：互联网发布：js图片切换效果代码编辑：程序博客网时间：2024/06/14 03:55

1 Download

choose a mirror http://www.apache.org/dyn/closer.cgi/hadoop/core/

download from renren for 0.23 version: hadoop-0.23.0.tar.gz

1.1 untar

tar zxfv hadoop-0.23.0.tar.gz

2 Run first hadoop program (locally)

2.1 compute pi

bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar 16 10000

Job Finished in 6.014 seconds
Estimated value of Pi is 3.14127500000000000000

2.2 word count

bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar wordcount -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar LICENSE.txt output

Result is in the output dir

congratulations, you get the first MapReduce program.

While we know Hadoop is used in parallel/distributed computing, so next let's configure it one by one.

3 Setup the first node (master)

3.1 SSH

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

id_dsa.pub is the public key of localhost

authorized_keys contains all the public keys trusted in current hosts.

Import the localhost public key into authorized_keys, then you can ssh localhost in passphraseless.

Similarly, you can cat id_dsa.pub to other hosts authorized_keys file. Then you can ssh to other hosts in passphraseless.

3.2 Config HDFS

etc/hadoop/core-site.xml (Default is here)

<configuration>     <property>         <name>fs.defaultFS</name>         <value>hdfs://172.16.100.122:9000</value>     </property></configuration>

etc/hadoop/hdfs-site.xml (Default is here)

<configuration><property><name>dfs.replication</name><value>1</value></property><property><name>dfs.namenode.name.dir</name><value>file:/home/tntuser/hadoop-0.23.0/data/hdfs/namenode</value></property><property><name>dfs.datanode.data.dir</name><value>file:/home/tntuser/hadoop-0.23.0/data/hdfs/datanode</value></property></configuration>

a full URI is needed for the name dir and data dir.

3.3 Format HDFS

mkdir data/hdfs/namenode
mkdir data/hdfs/datanode
bin/hdfs namenode -format

3.4 Start HDFS

sbin/hadoop-daemon.sh start|stop namenode
sbin/hadoop-daemon.sh start|stop datanode

Check

JPS should show NameNode, DataNode

Run several HDFS command

bin/hadoop fs -ls

bin/hadoop fs -mkdir test

bin/hadoop fs -rm -r test

3.5 Config MapReduce

etc/hadoop/mapred-site.xml

<?xml version="1.0"?><?xml-stylesheet href="configuration.xsl"?><configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property></configuration>

conf/yarn-site.xml

<?xml version="1.0"?><configuration><property><name>yarn.nodemanager.aux-services</name><value>mapreduce.shuffle</value></property><property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property><property><name>yarn.resourcemanager.resource-tracker.address</name><value>172.16.100.122:8025</value></property><property><name>yarn.resourcemanager.scheduler.address</name><value>172.16.100.122:8030</value></property><property><name>yarn.resourcemanager.address</name><value>172.16.100.122:8040</value></property></configuration>

conf/yarn-env.sh

export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}"
export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}"
export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}"

The conf directory that comes with Hadoop is no longer the default configuration directory. Rather, Hadoop looks in etc/hadoop for configuration files.

sbin/hadoop-daemon.sh call hdfs-config.sh, hdfs-config.sh calls hadoop-config.sh in $HADOOP_COMMON_HOME/libexec/hadoop-config.sh

3.6 Start MapReduce (YARN) Daemon

bin/yarn-daemon.sh start resourcemanager
bin/yarn-daemon.sh start nodemanager
bin/yarn-daemon.sh start historyserver

NodeManage may be fail because of 8080 is used by Tomcat

conf/yarn-env.sh

<property>  <name>mapreduce.shuffle.port</name>  <value>8090</value></property>

4 Run the hadoop program in single node

MapReduce JobHistory Server http://jhs_host:port/ Default HTTP port is 19888.

See the detail, the task is executed by node.

NameNode http://nn_host:port/ Default HTTP port is 50070,browser HDFS and hdfs nodes

ResourceManager http://rm_host:port/ Default HTTP port is 8088, browser map-reduce nodes

5 Setup the slave node

5.1 untar on the slave

5.2 copy config from master

scp 172.16.100.122:/home/tntuser/hadoop-0.23.0/etc/hadoop/*.xml etc/hadoop

scp 172.16.100.122:/home/tntuser/hadoop-0.23.0/conf/yarn-* conf

5.3 (re) format hdfs on master

shutdown daemons on master first

bin/hdfs namenode -format -clusterid hadoop_cluster

5.4 add slave hosts

conf/slave

172.16.100.122 //master

172.16.100.130

5.5 Start Master Daemons

sbin/hadoop-daemon.sh start|stop namenodesbin/hadoop-daemon.sh start|stop datanodebin/yarn-daemon.sh start resourcemanagerbin/yarn-daemon.sh start nodemanagerbin/yarn-daemon.sh start historyserver

5.6 Start Slave Daemons

sbin/hadoop-daemon.sh start|stop datanodebin/yarn-daemon.sh start nodemanager

6 Run the hadoop program in cluster

issue 1: temp directory already exists

 hdfs://172.16.100.122:9000/user/tntuser/QuasiMonteCarlo_TMP_3_141592654 already exists.  Please remove it first.

bin/hadoop fs -rm -r QuasiMonteCarlo_TMP_3_141592654

issues 2:

java.io.FileNotFoundException: File does not exist: hdfs://172.16.100.122:9000/user/tntuser/QuasiMonteCarlo_TMP_3_141592654/out/reduce-outat org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1614)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1638)at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:351)at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:360)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

1) dns config /etc/resolve.conf, make sure the dns nameserver is right

2) add master/slave hostname to each others /etc/hosts

172.16.100.122 dev122
172.16.100.130 dev130

3) check the hadoop slaves config file conf/slaves, make sure the hostname or ip is right

Reference

http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-0-23-0/

http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/

http://www.rpark.com/2011/05/building-hadoop-cluster.html

http://hadoop.apache.org/common/docs/current/single_node_setup.html

http://hadoop.apache.org/common/docs/current/cluster_setup.html

http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/SingleCluster.html