使用Python写Map-Reduce程序

来源：互联网发布：python配置opencv 编辑：程序博客网时间：2024/05/22 01:17

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python，这篇文章写得不错，不过在服务器上有些Python的库根本都没有安装，所以我将代码小小修改了一下。Python比Java更适合做快速开发，学学怎么通过Python语言编写Map-Reduce程序是很有价值的，
        首先编写一个实现map功能的Python程序，代码如下：
#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
        这个程序是非常简单，它一句句地从终端读取数据，然后将一句的字符分离出来，然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序，因为原文对key排序时使用了itemgetter，我测试的服务器安装的Python的operator模块并没有这个itemgetter方法，所以我感觉将排序去掉了，修改后的代码如下：
#!/usr/bin/env python

#from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
        #print '%s\t%s'% (word2count[word], count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in word2count.items():
    print '%s\t%s'% (word, count)
       这个程序也是非常直观的，它生成了一个字典，字典的key是字段名称，字典的value为字段出现的次数。统计完成之后，使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序，过程如下：
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo     1
foo     1
quux    1
labs    1
foo     1
bar     1
quux    1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py
labs    1
quux    2
foo     3
bar     1
       生成测试所用数据，然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
       测试命令如下（一定要加上"-file"参数，否则Job会失败）：
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file/home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
       测试打印数据如下：
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob: map 0% reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob: map 50% reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob: map 100% reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob: map 100% reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r--   1 henshao supergroup         26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs    1
quux    2
foo     3

bar 1

保存到此处，学习。

转自http://blog.163.com/ecy_fu/blog/static/4445126201002191329467/

0 0