python 实现 mapreducer

来源：互联网发布：湖南银楼软件下载编辑：程序博客网时间：2024/05/20 04:50
一：mapper.py（map类）


import sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print '%s\t%s' %(word,1)

二：reducer.py
import syscurrent_word = Nonecurrent_count = 0word = Nonefor  line in sys.stdin:    line = line.strip()    word,count = line.split('\t',1)    try:        count = int(count)    except ValueError:        continue    if current_word == word:        current_count += count    else:        if current_word:            print '%s\t%s'%(current_word,current_count)        current_count = count        current_word = wordif word == current_word:    print '%s\t%s'(current_word,current_count)
（将这两个python文件上传到master节点的root根目录下）
（注意将这两个python文件改权限）
#：chmod 777 mapper.py
#: chmod 777 reducer.py
三：测试代码
在root根目录下创建一个文件ddd.txt
#：vi ddd.txt
在ddd.txt中写入数据
ggg
ggg
hhh
hhh
aaa
保存文件，并退出
输入以下命令进行测试
[root@master ~]# more ddd.txt | python ./mapper.py | sort | python ./reduce
输出结果
aaa 1
ggg 2
hhh 2
四：实现hadoop的mapreduce
启动hadoop集群，在hdfs文件系统中建立一个文件夹为 input
hdfs dfs -mkdir /input
查看是否建立成功
hdfs dfs -ls -R /
将ddd.txt文件上传到 input目录下
hdfs dfs -put /root/ddd.txt /input
输入以下命令
#:/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.0*.jar
  -D stream.non.zero.exit.is.failure=false  -files 'mapper.py,ccc.py' -input /input/dat0203.log -output /zanni -mapper "python ./mapper.py" -reducer "python ./ccc.py"
任务完成，到hdfs文件系统的输出目录中，可以看到输出文件
阅读全文
0 0