使用Python写Map-Reduce程序
来源:互联网 发布:python配置opencv 编辑:程序博客网 时间:2024/05/22 01:17
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python,这篇文章写得不错,不过在服务器上有些Python的库根本都没有安装,所以我将代码小小修改了一下。Python比Java更适合做快速开发,学学怎么通过Python语言编写Map-Reduce程序是很有价值的,
首先编写一个实现map功能的Python程序,代码如下:
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
这个程序是非常简单,它一句句地从终端读取数据,然后将一句的字符分离出来,然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序,因为原文对key排序时使用了itemgetter,我测试的服务器安装的Python的operator模块并没有这个itemgetter方法,所以我感觉将排序去掉了,修改后的代码如下:
#!/usr/bin/env python
#from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
#print '%s\t%s'% (word2count[word], count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in word2count.items():
print '%s\t%s'% (word, count)
这个程序也是非常直观的,它生成了一个字典,字典的key是字段名称,字典的value为字段出现的次数。统计完成之后,使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序,过程如下:
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py
labs 1
quux 2
foo 3
bar 1
生成测试所用数据,然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
测试命令如下(一定要加上"-file"参数,否则Job会失败):
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file/home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
测试打印数据如下:
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob: map 0% reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob: map 50% reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob: map 100% reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob: map 100% reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r-- 1 henshao supergroup 26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs 1
quux 2
foo 3
首先编写一个实现map功能的Python程序,代码如下:
#!/usr/bin/env python
imp
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
这个程序是非常简单,它一句句地从终端读取数据,然后将一句的字符分离出来,然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序,因为原文对key排序时使用了itemgetter,我测试的服务器安装的Python的operator模块并没有这个itemgetter方法,所以我感觉将排序去掉了,修改后的代码如下:
#!/usr/bin/env python
#from operator imp
imp
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
#print '%s\t%s'% (word2count[word], count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in word2count.items():
print '%s\t%s'% (word, count)
这个程序也是非常直观的,它生成了一个字典,字典的key是字段名称,字典的value为字段出现的次数。统计完成之后,使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序,过程如下:
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py
labs 1
quux 2
foo 3
bar 1
生成测试所用数据,然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
测试命令如下(一定要加上"-file"参数,否则Job会失败):
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file/home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
测试打印数据如下:
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob: map 0% reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob: map 50% reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob: map 100% reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob: map 100% reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r-- 1 henshao supergroup 26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs 1
quux 2
foo 3
bar 1
保存到此处,学习。
转自http://blog.163.com/ecy_fu/blog/static/4445126201002191329467/
0 0
- 使用Python写Map-Reduce程序
- 用Python写Map Reduce程序
- Hadoop 使用 Python 来写 map-reduce
- 使用Python实现Map Reduce程序
- python写map,reduce访问集群文件
- python 内置函数map、reduce的使用
- Python 进阶 —— 使用 map reduce
- python map(),reduce()函数的使用
- python 学习--map 和 reduce的使用
- python 之 map/reduce/filter使用体会
- WordCount------自己写的第一个map/reduce程序------
- map reduce相关程序
- map reduce相关程序
- python:map filter reduce
- python filter/map/reduce
- Python-----map/reduce
- Python map reduce
- Python 中的 map() reduce()
- gradle中classpath具体路径
- cglib动态代理介绍(一)
- 如何使用Node.js在Ubuntu上写一个HelloWorld程序
- C#学习笔记之自加和自减
- AVAudioSession
- 使用Python写Map-Reduce程序
- Masonry介绍与使用实践:快速上手Autolayout
- 【svn】server建立以及svn使用
- 设计模式读书笔记-----命令模式
- Selenium Keys键盘按键包使用实例
- Linux常用命令(一)Linux管理文件和目录的命令
- 构建本地单元测试
- oracle select ...start with ...connect by ..
- 简易聊天程序教程(四)客户端基本功能