python写map,reduce访问集群文件

来源：互联网发布：android 彩票app源码编辑：程序博客网时间：2024/06/02 01:40

python写map,reduce访问集群文件

在/home/zkf/File/python/test-PythoninHadoop目录下写mapper.py和reducer.py

mapper.py

#!/usr/bin/env python
#coding:utf-8

import sys

for line in sys.stdin: # 遍历读入数据的每一行
    line = line.strip() # 将行尾行首的空格去除
    words = line.split() #按空格将句子分割成单个单词
    for word in words:
        print '%s\t%s' %(word, 1)

reducer.py

#!/usr/bin/env python
#coding:utf-8

from operator import itemgetter
import sys

current_word = None   # 为当前单词
current_count = 0 # 当前单词频数
word = None

for line in sys.stdin:
    words = line.strip() # 去除字符串首尾的空白字符
    word, count = words.split('\t') # 按照制表符分隔单词和数量

    try:
        count = int(count) # 将字符串类型的‘1’转换为整型1
    except ValueError:
        continue

    if current_word == word: # 如果当前的单词等于读入的单词
        current_count += count # 单词频数加1
    else:
        if current_word: # 如果当前的单词不为空则打印其单词和频数
            print '%s\t%s' %(current_word, current_count)
        current_count = count # 否则将读入的单词赋值给当前单词，且更新频数
        current_word = word

if current_word == word:
    print '%s\t%s' %(current_word, current_count)

修改这两个文件的权限:

使用命令: chmod +x mapper.py

chmod +x reducer.py

上传一个文本文件到集群上,我是传到了/user/zkf/test-PythoninHadoop/input目录下

设置streaming文件中的jar包:

cd hadoop目录下,找到hadoop-straming*.jar文件,写入环境变量,具体命令如下:

cd /esr/local/hadoop-1.2.1

find ./ -name "*streaming*" ###找到hadoop-streaming*.jar位置

vi ~/.bashrc ###打开环境变量进行配置

export STREAM=/usr/local/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar

执行这一文件,命令如下:

hadoop jar /usr/local/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -file/home/zkf/File/python/test-PythoninHadoop/mapper.py -mapper mapper.py -file/home/zkf/File/python/test-PythoninHadoop/reducer.py -reducer reducer.py -input/user/zkf/test-PythoninHadoop/input/*.txt -output /user/zkf/test-PythoninHadoop/output

执行成功会在集群的/user/zkf/test-PythoninHadoop/output下得到结果,查看可得到wordcount结果

0 0