python实现单词计数的mapreduce

来源：互联网发布：梅雨知时节的作品编辑：程序博客网时间：2024/05/16 06:30

map函数

import sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words :        print "%s\t%s" % (word , 1)

reduce函数

import syscurrent_word=Nonecurrent_count=0for line in sys.stdin:    line=line.strip()    word=line.split("\t",1)    if current_word==word[0]:#当前单词如果为本次传过来的单词，则计数加一        current_count=current_count+1            if current_word==None:#第一次判断当前单词是否为空，若为空，赋值，计数为一        current_word=word[0]        current_count=current_count+1    elif current_word!=word[0]:#当前单词如果不为本次传过来的，则先把当前的输出，再赋值，计数        print "%s\t%s" %(current_word,current_count)        current_count=1        current_word=word[0]print "%s\t%s" %(current_word,current_count)#打印循环结束后，最后一次的单词

测试：

echo "hello word hello Hadoop map reduce" | ./mapper.py |sort -k1,1| ./reducer.py

Python只能对排好序的单词进行计数，在Hadoop中会实现对单词的排序

在Hadoop上运行：

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file test/code/mapper.py -mapper test/code/mapper.py \
-file test/code/reducer.py -reducer test/code/reducer.py \
-input /user/rte/hdfs_in/* -output /user/rte/hdfs_out

0 0