hadoop-python——Wordcount程序：python实现详解

来源：互联网发布：公务员取消编制知乎编辑：程序博客网时间：2024/06/02 05:49

从hadoop官网提供用python语言编写的wordcount程序如下：

mapper.py函数如下：

import sys    # 调用标准输入流  for line in sys.stdin:      # 读取文本内容       line = line.strip()      # 对文本内容分词，形成一个列表     words = line.split()      # 读取列表中每一个元素的值      for word in words:          # map函数输出，key为word，下一步将进行shuffle过程，将按照key排序，输出，这两步为map阶段工作为，在本地节点进行          print '%s\t%s' % (word, 1)

reducer.py函数如下：

#!/usr/bin/env python    from operator import itemgetter  import sys    current_word = None  current_count = 0  word = None    # input comes from STDIN  for line in sys.stdin:      # remove leading and trailing whitespace      line = line.strip()        # parse the input we got from mapper.py      word, count = line.split('\t', 1)        # convert count (currently a string) to int      try:          count = int(count)      except ValueError:          # count was not a number, so silently          # ignore/discard this line          continue        # this IF-switch only works because Hadoop sorts map output      # by key (here: word) before it is passed to the reducer      if current_word == word:          current_count += count      else:          if current_word:              # write result to STDOUT              print '%s\t%s' % (current_word, current_count)          current_count = count          current_word = word    # do not forget to output the last word if needed!  if current_word == word:      print '%s\t%s' % (current_word, current_count)

预备知识：mapreduce运作框架

过程详解

调用接口说明：调用python中的标准输入流 sys.stdin ，MAP具体过程是，HadoopStream每次从input文件读取一行数据，然后传到sys.stdin中，运行payhon的map函数脚本，然后用print输出回HadoopStreeam. REDUCE过程一样

所以M和R函数的输入格式为 for line in sys.stdin:

line=line.strip

Mapper 过程如下：

第一步，在每个节点上运行我们编写的map程序：（多节点同时运行）

import sys

# 调用标准输入流

for line in sys.stdin:

# 读取文本内容

line = line.strip()

# 对文本内容分词，形成一个列表

words = line.split()

# 读取列表中每一个元素的值

for word in words:

# map函数输出，key为word，下一步将进行shuffle过程，将按照key排序，输出，这两步为map阶段工作为，在本地节点进行

print '%s\t%s' % (word, 1)

第二步，hadoop框架，把我们运行的结果，进入shuffle过程，每个节点对key单独进行排序，然后输出.

Reducer 过程如下：（单节点运行）

第一步，merge过程，把所有节点汇总到一个节点，合并并且按照key排序.

第二步，运行reducer函数.

实例：

Txt1.txt内容：hello my name is pan

hello hadoop

hello world

123

Txt2.txt内容：hello

123

假如这两个文本分别运行在两个节点上假设节点1运行文本1，节点2运行文本2.

Map函数运行

节点1结果

hello 1

my 1

name 1

is 1

pan 1

hello 1

hadoop 1

hello 1

world 1

123 1

节点2结果

hello 1

123 1

Shuffle步结果

节点1结果

123 1

hadoop 1

hello 1

is 1

my 1

name 1

pan 1

world 1

节点2结果

123 1

hello 1

Merge步运行结果：

123 1

hadoop 1

hello 1

is 1

my 1

name 1

pan 1

world 1

Reducer函数运行结果

123 2

hadoop 1

hello 4

is 1

my 1

name 1

pan 1

world 1

0 0