学习hadoop（2）单词统计

来源：互联网发布：中国gdp增速放缓知乎编辑：程序博客网时间：2024/05/21 07:06

前一篇日志简单解释了hadoop streaming和用python些Mapper和Reducer，本文直接写过程和代码，后面会写一篇如何join。

1. 实现mapper和reducer，代码如下：#!/usr/bin/env python# coding:utf-8"""@author: duanmeng@outlook.com@file:   word_count.py@bref:   count each word in file word_count and output ('word','sum')"""import sysdef mapper():    for line in sys.stdin:        item = line.strip().strip('.').split(' ')        for word in item:            print "%s\t%s" % (word, '1')def reducer():    (last_word, last_count) = (None, 0)    for line in sys.stdin:        item = line.strip().split('\t')        word = item[0]        count = item[1]        #print word, count        if last_word and last_word != word:            print "%s\t%s" % (last_word, last_count)            last_word = word            last_count = int(count)        else:            last_word = word            last_count = last_count + int(count)    if last_word:        print "%s\t%s" % (last_word, last_count)if __name__ == '__main__':    type = sys.argv[1]    if type == 'm':        mapper()    elif type == 'r':        reducer()    else:        exit(1)2. 上传word_count到hdfshadoop fs -put word_count $HDFS/test内容如下：The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.3. 运行hadoop任务hadoop streaming -input $HDFS/test/word_count -output $HDFS/output -mapper 'python word_count.py m' -reducer 'python word_count.py r' -file word_count.py -numReduceTasks 14. 查看结果hadoop fs -cat $HDFS/output/part-00000The     10brown   10dog     10fox     10jumps   10lazy    10over    10quick   10the     10

0 0