python写map,reduce访问集群文件

来源:互联网 发布:android 彩票app源码 编辑:程序博客网 时间:2024/06/02 01:40

python写map,reduce访问集群文件

/home/zkf/File/python/test-PythoninHadoop目录下写mapper.pyreducer.py


mapper.py


#!/usr/bin/env python
#coding:utf-8

import sys
 
for line in sys.stdin:  # 遍历读入数据的每一行
    line = line.strip()  # 将行尾行首的空格去除
    words = line.split()  #按空格将句子分割成单个单词
    for word in words:
        print '%s\t%s' %(word, 1)


reducer.py


#!/usr/bin/env python
#coding:utf-8

from operator import itemgetter
import sys

current_word = None   # 为当前单词
current_count = 0  # 当前单词频数
word = None

for line in sys.stdin:
    words = line.strip()  # 去除字符串首尾的空白字符
    word, count = words.split('\t')  # 按照制表符分隔单词和数量
    
    try:
        count = int(count)  # 将字符串类型的‘1’转换为整型1
    except ValueError:
        continue

    if current_word == word:  # 如果当前的单词等于读入的单词
        current_count += count  # 单词频数加1
    else:
        if current_word:  # 如果当前的单词不为空则打印其单词和频数
            print '%s\t%s' %(current_word, current_count)  
        current_count = count  # 否则将读入的单词赋值给当前单词,且更新频数
        current_word = word

if current_word == word:
    print '%s\t%s' %(current_word, current_count)


修改这两个文件的权限:

使用命令: chmod +x mapper.py

                chmod +x reducer.py

上传一个文本文件到集群上,我是传到了/user/zkf/test-PythoninHadoop/input目录下


设置streaming文件中的jar包:

cd hadoop目录下,找到hadoop-straming*.jar文件,写入环境变量,具体命令如下:

cd /esr/local/hadoop-1.2.1

find ./ -name "*streaming*"     ###找到hadoop-streaming*.jar位置

vi ~/.bashrc    ###打开环境变量进行配置

export STREAM=/usr/local/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar


执行这一文件,命令如下:

hadoop jar /usr/local/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -file/home/zkf/File/python/test-PythoninHadoop/mapper.py -mapper mapper.py -file/home/zkf/File/python/test-PythoninHadoop/reducer.py -reducer reducer.py -input/user/zkf/test-PythoninHadoop/input/*.txt -output /user/zkf/test-PythoninHadoop/output

执行成功会在集群的/user/zkf/test-PythoninHadoop/output下得到结果,查看可得到wordcount结果


0 0
原创粉丝点击