python hadoop stream 传参

来源:互联网 发布:淘宝店铺标志 编辑:程序博客网 时间:2024/06/01 09:47

用python写mapreduce,会遇到通过在shell脚本中传参数,通常有两种方式

1.第一种方式通过sys.argv 获取参数值

mapper代码: countmapper.py

import sysarg1 = sys.argv[1]arg2 = sys.argv[2]for line in sys.stdin:  line = line.strip()  item, count = line.split(',')  print '%s\t%s' % (item, count)

说明sys.argv获取参数下标是从1开始

命令配置方式

> hadoop jar hadoop-streaming.jar \> -mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \> -reducer 'count_reducer.py arg3' -file count_reducer.py \> ...

2.第二种方式通过os.environ.get获取参数值

mapper代码:

#!/usr/bin/env python  # vim: set fileencoding=utf-8  import sys  import os      def main():      card_start = os.environ.get('card_start')      card_last = os.environ.get('card_last')      trans_at = float(os.environ.get('trans_at'))        for line in sys.stdin:          detail = line.strip().split(',')          card = detail[0]          money = float(detail[17])          if trans_at == money and card_start == card[1 : 7] and card_last == card[-4 : ]:              print '%s\t%s' % (line.strip(), detail[1])  

通过参数名称获取值

命令配置方式

hadoop jar ./hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar \          -input $1 \          -output trans_record/result \          -file map.py \          -file reduce.py \          -mapper "python map.py" \          -reducer "python reduce.py" \          -jobconf mapred.reduce.tasks=1 \          -jobconf mapred.job.name="qianjc_trans_record" \          -cmdenv "card_start=$2" \          -cmdenv "card_last=$3" \          -cmdenv "trans_at=$4" 


个人微信公众号


0 0
原创粉丝点击