用Python来写MapReduce的实际应用程序
来源:互联网 发布:java爬虫框架使用排行 编辑:程序博客网 时间:2024/05/22 13:32
用Python来写分布式的程序。这样速度快。便于调试,更有实际意义。MapReduce适合于对文本文件的处理及数据挖掘用:
在每台机器上:su - hadoop
wget http://www.python.org/ftp/python/3.0.1/Python-3.0.1.tar.bz2
tar jxvf Python-3.0.1.tar.bz2
cd Python-3.0.1
./configure --prefix=/home/hadoop/python;make;make install
vi /home/hadoop/mapper.py
#!/home/hadoop/python/bin/python3.0
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ("%st%s" % (word, 1))
vi /home/hadoop/reduce.py
#!/home/hadoop/python/bin/python3.0
from operator import itemgetter
import sys
word2count = {}
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
for word, count in sorted_word2count:
print ("%st%s" % (word, count))
测测好不好用:echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reduce.py
bar 1
foo 3
labs 1
quux 2
在各个节点上都要准备好这两个文件啊!!!
在master主节点上执行:
# 拷贝conf目录到hdfs文件系统中$ cd /home/hadoop/hadoop-0.19.1
$ bin/hadoop dfs -copyFromLocal conf 111
# 查看一下是否已经拷过去了$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2009-05-18 15:27 /user/hadoop/111
# 分布计算$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reduce.py -input 111/* -output 111-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar29198/] [] /tmp/streamjob29199.jar tmpDir=null
[...] INFO mapred.FileInputFormat: Total input paths to process : 12
[...] INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local]
[...] INFO streaming.StreamJob: Running job: job_200905191453_0001
[...] INFO streaming.StreamJob: To kill this job, run:
...
[...]
[...] INFO streaming.StreamJob: map 0% reduce 0%
[...] INFO streaming.StreamJob: map 43% reduce 0%
[...] INFO streaming.StreamJob: map 86% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 33%
[...] INFO streaming.StreamJob: map 100% reduce 70%
[...] INFO streaming.StreamJob: map 100% reduce 77%
[...] INFO streaming.StreamJob: map 100% reduce 100%
[...] INFO streaming.StreamJob: Job complete: job_200905191453_0001
[...] INFO streaming.StreamJob: Output: 111-output [hadoop@wangyin4 hadoop-0.19.1]$
$ bin/hadoop dfs -ls 111-output
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2009-05-19 14:54 /user/hadoop/111-output/_logs
-rw-r--r-- 2 hadoop supergroup 30504 2009-05-19 16:26 /user/hadoop/111-output/part-00000
$ bin/hadoop dfs -cat 111-output/part-00000
you 3
you've 1
your 1
zero 3
zero, 1
Over,搞定。大家可以拓展这个例子,写出自己的应用来。
- 用Python来写MapReduce的实际应用程序
- 用Python来写MapReduce的实际应用程序
- 用python写MapReduce函数
- 用Python写一个 Hadoop MapReduce 程序
- 用Python写一个 Hadoop MapReduce 程序
- 用scala来写mapreduce做数据去重
- Python写mongodb mapreduce实例
- 用python写MapReduce函数 以WordCount为例,比较详细的
- hadoop中使用python写mapreduce遇到的问题
- 使用 Ildasm, ILasm, Peverify 来 Crack 别人写的应用程序
- 用PHP写hadoop的mapreduce程序
- 用BeautifulSoup来写python爬虫
- 使用appscript+python来控制Mac下的GUI应用程序
- 用mapreduce来操作hbase的两点优化
- 用mapreduce来操作hbase的两点优化
- 用Python来写一个男女相亲小程序|码农的情人节
- 用html来设计应用程序的UI
- python断言的实际应用
- WAITFOR (Transact-SQL)
- 电脑垃圾清除器
- PHP+MySQL 整站系统CMS 推荐?
- 一个计算脚本运行时间的类
- LINUX下的C编程实战之gcc/gdb/make
- 用Python来写MapReduce的实际应用程序
- 使用Python二进制与十进制之间的转化,可以操作浮点数!
- PCB常见的错误分析yingxuexuan写
- 拥塞控制
- 在Ubuntu9.10环境下的Hadoop分布式模式的部署
- PHP登陆代码问题
- 字符串常见算法之一:查找一个短串在一个长串中位置
- Visual Assist X 10.6.1830.0 常用快捷键
- 使用JAVA导入某个MSN帐号的好友列表并发送消息