centos6.4安装hadoop-1.2.1,实现wordcount功能

来源:互联网 发布:阿里云客服真的赚钱吗 编辑:程序博客网 时间:2024/05/10 22:42

经过了近一个月的学习hadoop知识,大致了解了hadoop的功能,工作原理和相关介绍。觉得可是试着搭建hadoop的伪分布模式实现类似入门程序语言 Hello Word 的 wordcount程序了。 

Let's go!


软件环境:

1、虚拟机 centos 6.4   virtaulbox

2、java 1.7

3、hadoop 1.2.1


步骤:

1、虚拟机安装。  安装的虚拟机为最小安装模式,分配的是 8G的磁盘空间  1G的RAM 


2、安装虚拟机后,配置虚拟机的网络参数。


2.1  使用root 账户登录。

2.2  使用命令 vi /etc/sysconfig/network-scripts/ifcfg-eth0 配置网络,配置如下:

[root@caixen-1 usr]# vi /etc/sysconfig/network-scripts/ifcfg-eth0 


DEVICE=eth0
HWADDR=08:00:27:50:23:5B
TYPE=Ethernet
UUID=2688ca88-3ed9-4187-802c-2f84729c56ed
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=none
IPADDR=192.168.2.100
NETMASK=255.255.255.0
GATEWAY=192.168.2.1

保存并退出


2.3  添加nameserver 解决域名不能解析的问题

[root@caixen-1 usr]# vi /etc/resolv.conf 


nameserver 192.168.2.1

保存并退出


2.4  重启network, 使配置生效

[root@caixen-1 usr]# service network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]


2.5  查看是否联通网络

[root@caixen-1 usr]# ping www.baidu.com
PING www.a.shifen.com (61.135.169.121) 56(84) bytes of data.
64 bytes from 61.135.169.121: icmp_seq=1 ttl=55 time=36.9 ms
64 bytes from 61.135.169.121: icmp_seq=2 ttl=54 time=41.2 ms
64 bytes from 61.135.169.121: icmp_seq=3 ttl=54 time=36.4 ms
64 bytes from 61.135.169.121: icmp_seq=4 ttl=55 time=34.8 ms
64 bytes from 61.135.169.121: icmp_seq=5 ttl=55 time=35.7 ms
64 bytes from 61.135.169.121: icmp_seq=6 ttl=54 time=34.9 ms
64 bytes from 61.135.169.121: icmp_seq=7 ttl=54 time=30.4 ms
^C
--- www.a.shifen.com ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 6552ms
rtt min/avg/max/mdev = 30.485/35.812/41.270/2.968 ms

网络配置成功!


3、安装JDK 1.7

3.1  在 /usr/ 目录下 创建 java 文件夹用于存放java文件

[root@caixen-1 usr]# mkdir java


3.2  上传JDK到java文件目录下,并解压

[root@caixen-1 usr]# cd java
[root@caixen-1 java]# ls -al
total 140052
drwxr-xr-x.  2 root root      4096 Jan 20 23:21 .
drwxr-xr-x. 14 root root      4096 Jan 20 23:05 ..
-rw-r--r--.  1 root root 143398235 Jan 20 23:21 jdk-7u71-linux-i586.tar.gz


[root@caixen-1 java]# tar -xzf jdk-7u71-linux-i586.tar.gz 
[root@caixen-1 java]# ls -al
total 140056
drwxr-xr-x.  3 root root      4096 Jan 20 23:23 .
drwxr-xr-x. 14 root root      4096 Jan 20 23:05 ..
drwxr-xr-x.  8 uucp  143      4096 Sep 27 08:30 jdk1.7.0_71
-rw-r--r--.  1 root root 143398235 Jan 20 23:21 jdk-7u71-linux-i586.tar.gz

3.3  配置JAVA_HOME, PATH

[root@caixen-1 ~]# vi /etc/profile

# java and hadoop enviroment
export JAVA_HOME=/usr/java/jdk1.7.0_71
export CLASSPATH=".:$JAVA_HOME/lib:$CLASSPATH"
export  PATH="$JAVA_HOME/bin:$PATH"

保存并退出


3.4  测试JAVA是否能运行

[root@caixen-1 ~]# java -version

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) Client VM (build 24.71-b01, mixed mode)

配置成功!


4、安装hadoop 1.2.1


4.1  在 /usr/ 目录下 创建 hadoop文件夹用于存放hadoop文件

[root@caixen-1 usr]# mkir hadoop


4.2  上传hadoop到hadoop文件夹中,并解压

[root@caixen-1 usr]# cd hadoop/
[root@caixen-1 hadoop]# ls -al
total 62364
drwxr-xr-x.  2 root root     4096 Jan 21 21:36 .
drwxr-xr-x. 14 root root     4096 Jan 20 23:05 ..
-rw-r--r--.  1 root root 63851630 Jan 21 21:37 hadoop-1.2.1.tar.gz

[root@caixen-1 hadoop]# tar -xzf hadoop-1.2.1.tar.gz 
[root@caixen-1 hadoop]# ls -al
total 62368
drwxr-xr-x.  3 root root     4096 Jan 21 21:38 .
drwxr-xr-x. 14 root root     4096 Jan 20 23:05 ..
drwxr-xr-x. 15 root root     4096 Jul 23  2013 hadoop-1.2.1
-rw-r--r--.  1 root root 63851630 Jan 21 21:37 hadoop-1.2.1.tar.gz

4.3  配置HADOOP_PREFIX

[root@caixen-1 hadoop]# vi /etc/profile

# java and hadoop enviroment
export JAVA_HOME=/usr/java/jdk1.7.0_71
export HADOOP_PREFIX=/usr/hadoop/hadoop-1.2.1
export CLASSPATH=".:$JAVA_HOME/lib:$CLASSPATH"
export PATH="$JAVA_HOME/bin:$PATH:$HADOOP_PREFIX/bin:$PATH"
export HADOOP_PREFIX PATH CLASSPATH

保存并退出


4.4  测试hadoop是否能正常运行

[root@caixen-1 hadoop]# hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  oiv                  apply the offline fsimage viewer to an fsimage
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  historyserver        run job history servers as a standalone daemon
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  distcp2 <srcurl> <desturl> DistCp version 2
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

正常运行!


5、 配置hadoop


5.1   配置 core-size.xml

[root@caixen-1 hadoop]# vi hadoop-1.2.1/conf/ core-size.xml

<configuration>


<property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
</property>


<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/caixen/hadooptmpdir</value>
        <description> a base for other temporay directories.</description>
</property>


</configuration>


5.2   配置hdfs-site.xml

[root@caixen-1 hadoop]# vi hadoop-1.2.1/conf/hdfs-site.xml 

<configuration>


<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>


</configuration>


5.3  配置mapred-site.xml

[root@caixen-1 hadoop]# vi hadoop-1.2.1/conf/mapred-site.xml 

<configuration>


<property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
</property>


</configuration>


5.4 配置 hadoop-env.sh 添加 JAVA_HOME路径

export JAVA_HOME=/usr/java/jdk1.7.0_71 



handoop配置完成!



6、 停用iptales 和 修改 hosts  

[root@caixen-1 hadoop]# service iptables stop
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Unloading modules:                               [  OK  ]


[root@caixen-1 hadoop]# vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.100  caixen-1


7、如果安装 server 和 client opneSSH, 并配置 无密钥 SSH 登陆

[root@caixen-1 hadoop]# yum install -y openssh-server.i686 openssh-clients.i686


安装完成后配置无密钥登陆

[root@caixen-1 hadoop]# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Generating public/private dsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
42:c1:bc:5f:45:8e:95:d5:72:59:8d:12:59:b6:a8:40 root@caixen-1
The key's randomart image is:
+--[ DSA 1024]----+
|     o. E  .+*+o=|
|      oo   +=ooo+|
|      ... ..o..o |
|     ..  ...     |
|      ..S..      |
|       ..        |
|                 |
|                 |
|                 |
+-----------------+
[root@caixen-1 hadoop]# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys


测试 ssh localhost

[root@caixen-1 hadoop]# ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 8b:69:86:bf:3c:78:e3:f8:9e:25:a1:09:ce:25:2a:46.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Last login: Wed Jan 21 21:19:57 2015 from 192.168.2.102

成功!  exit 退出


7、 格式化 namenode

[root@caixen-1 /]# hadoop namenode -format
15/01/21 22:20:16 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = caixen-1/220.250.64.225
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.7.0_71
************************************************************/
15/01/21 22:20:16 INFO util.GSet: Computing capacity for map BlocksMap
15/01/21 22:20:16 INFO util.GSet: VM type       = 32-bit
15/01/21 22:20:16 INFO util.GSet: 2.0% max memory = 1013645312
15/01/21 22:20:16 INFO util.GSet: capacity      = 2^22 = 4194304 entries
15/01/21 22:20:16 INFO util.GSet: recommended=4194304, actual=4194304
15/01/21 22:20:17 INFO namenode.FSNamesystem: fsOwner=root
15/01/21 22:20:17 INFO namenode.FSNamesystem: supergroup=supergroup
15/01/21 22:20:17 INFO namenode.FSNamesystem: isPermissionEnabled=true
15/01/21 22:20:17 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/01/21 22:20:17 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/01/21 22:20:17 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
15/01/21 22:20:17 INFO namenode.NameNode: Caching file names occuring more than 10 times 
15/01/21 22:20:17 INFO common.Storage: Image file /home/caixen/hadooptmpdir/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
15/01/21 22:20:17 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/caixen/hadooptmpdir/dfs/name/current/edits
15/01/21 22:20:17 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/caixen/hadooptmpdir/dfs/name/current/edits
15/01/21 22:20:17 INFO common.Storage: Storage directory /home/caixen/hadooptmpdir/dfs/name has been successfully formatted.
15/01/21 22:20:17 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at caixen-1/220.250.64.225
************************************************************/

格式化成功!


8、 启动hadoop程序

[root@caixen-1 /]# start-all.sh
starting namenode, logging to /usr/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-root-namenode-caixen-1.out
localhost: starting datanode, logging to /usr/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-root-datanode-caixen-1.out
localhost: starting secondarynamenode, logging to /usr/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-root-secondarynamenode-caixen-1.out
starting jobtracker, logging to /usr/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-root-jobtracker-caixen-1.out
localhost: starting tasktracker, logging to /usr/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-root-tasktracker-caixen-1.out

9、 jps查看启动状态

[root@caixen-1 /]# jps
2746 DataNode
2646 NameNode
2847 SecondaryNameNode
3189 Jps
3031 TaskTracker
2927 JobTracker


hadoop 成功运行


10、 进行wordcount 程序实例


10.1  本地文件夹hadoop_data中写入 a.txt      b.txt文件

[root@caixen-1 /]# mkdir /usr/hadoop_data
[root@caixen-1 /]# cd /usr/hadoop_data/
[root@caixen-1 hadoop_data]# echo "hello hadoop" > a.txt
[root@caixen-1 hadoop_data]# echo "hello  bala  bala  bala" > b.txt
[root@caixen-1 hadoop_data]# ls -al
total 16
drwxr-xr-x.  2 root root 4096 Jan 21 22:30 .
drwxr-xr-x. 15 root root 4096 Jan 21 22:29 ..
-rw-r--r--.  1 root root   13 Jan 21 22:29 a.txt
-rw-r--r--.  1 root root   24 Jan 21 22:30 b.txt

10.2  将txt文件夹 拷贝到 HDFS 中

[root@caixen-1 hadoop_data]# hadoop dfs -mkdir /input

[root@caixen-1 hadoop_data]# hadoop dfs -ls /
Found 2 items
drwxr-xr-x   - root supergroup          0 2015-01-21 22:21 /home
drwxr-xr-x   - root supergroup          0 2015-01-21 22:32 /input

[root@caixen-1 hadoop_data]# hadoop dfs -put /usr/hadoop_data/* /input

[root@caixen-1 hadoop_data]# hadoop dfs -ls /input/
Found 2 items
-rw-r--r--   1 root supergroup         13 2015-01-21 22:34 /input/a.txt
-rw-r--r--   1 root supergroup         24 2015-01-21 22:34 /input/b.txt


拷贝成功!


10.3  执行wordcount 实例

[root@caixen-1 hadoop_data]# hadoop jar /usr/hadoop/hadoop-1.2.1/hadoop-examples-1.2.1.jar wordcount /input /output
15/01/21 22:44:52 INFO input.FileInputFormat: Total input paths to process : 2
15/01/21 22:44:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/01/21 22:44:52 WARN snappy.LoadSnappy: Snappy native library not loaded
15/01/21 22:44:53 INFO mapred.JobClient: Running job: job_201501212242_0001
15/01/21 22:44:54 INFO mapred.JobClient:  map 0% reduce 0%
15/01/21 22:45:11 INFO mapred.JobClient:  map 100% reduce 0%
15/01/21 22:45:22 INFO mapred.JobClient:  map 100% reduce 33%
15/01/21 22:45:24 INFO mapred.JobClient:  map 100% reduce 100%
15/01/21 22:45:28 INFO mapred.JobClient: Job complete: job_201501212242_0001
15/01/21 22:45:28 INFO mapred.JobClient: Counters: 29
15/01/21 22:45:28 INFO mapred.JobClient:   Job Counters 
15/01/21 22:45:28 INFO mapred.JobClient:     Launched reduce tasks=1
15/01/21 22:45:28 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=29854
15/01/21 22:45:28 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/01/21 22:45:28 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/01/21 22:45:28 INFO mapred.JobClient:     Launched map tasks=2
15/01/21 22:45:28 INFO mapred.JobClient:     Data-local map tasks=2
15/01/21 22:45:28 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=11933
15/01/21 22:45:28 INFO mapred.JobClient:   File Output Format Counters 
15/01/21 22:45:28 INFO mapred.JobClient:     Bytes Written=24
15/01/21 22:45:28 INFO mapred.JobClient:   FileSystemCounters
15/01/21 22:45:28 INFO mapred.JobClient:     FILE_BYTES_READ=54
15/01/21 22:45:28 INFO mapred.JobClient:     HDFS_BYTES_READ=233
15/01/21 22:45:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=171617
15/01/21 22:45:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=24
15/01/21 22:45:28 INFO mapred.JobClient:   File Input Format Counters 
15/01/21 22:45:28 INFO mapred.JobClient:     Bytes Read=37
15/01/21 22:45:28 INFO mapred.JobClient:   Map-Reduce Framework
15/01/21 22:45:28 INFO mapred.JobClient:     Map output materialized bytes=60
15/01/21 22:45:28 INFO mapred.JobClient:     Map input records=2
15/01/21 22:45:28 INFO mapred.JobClient:     Reduce shuffle bytes=60
15/01/21 22:45:28 INFO mapred.JobClient:     Spilled Records=8
15/01/21 22:45:28 INFO mapred.JobClient:     Map output bytes=58
15/01/21 22:45:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=247341056
15/01/21 22:45:28 INFO mapred.JobClient:     CPU time spent (ms)=2940
15/01/21 22:45:28 INFO mapred.JobClient:     Combine input records=6
15/01/21 22:45:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=196
15/01/21 22:45:28 INFO mapred.JobClient:     Reduce input records=4
15/01/21 22:45:28 INFO mapred.JobClient:     Reduce input groups=3
15/01/21 22:45:28 INFO mapred.JobClient:     Combine output records=4
15/01/21 22:45:28 INFO mapred.JobClient:     Physical memory (bytes) snapshot=323555328
15/01/21 22:45:28 INFO mapred.JobClient:     Reduce output records=3
15/01/21 22:45:28 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1042567168
15/01/21 22:45:28 INFO mapred.JobClient:     Map output records=6


执行成功!


10.4  查看执行结果

[root@caixen-1 hadoop_data]# hadoop dfs -ls /output
Found 3 items
-rw-r--r--   1 root supergroup          0 2015-01-21 22:45 /output/_SUCCESS
drwxr-xr-x   - root supergroup          0 2015-01-21 22:44 /output/_logs
-rw-r--r--   1 root supergroup         24 2015-01-21 22:45 /output/part-r-00000
[root@caixen-1 hadoop_data]# hadoop dfs -cat /output/part-r-00000
bala 3
hadoop 1
hello 2

至此 , hadoop 伪分布测试环境搭建完成


如果在执行 HDFS 删除文件时  提示为 safe mode 需要关闭 safemode

[root@caixen-1 hadoop_data]# hadoop dfs -rmr /output
rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /output. Name node is in safe mode.

[root@caixen-1 hadoop_data]# hadoop dfsadmin -safemode leave

Safe mode is OFF

这时  即可删除 HDFS 中的文件了

[root@caixen-1 hadoop_data]# hadoop dfs -rmr /output
Deleted hdfs://localhost:9000/output
















0 0