学习hadoop(3)join日志

来源:互联网 发布:golang syscall说明 编辑:程序博客网 时间:2024/05/20 07:51

假如在$HDFS/genders有个日志$HDFS/genders/id_gender内容是<id, gender>,$HDFS/names有个日志$HDFS/names/id_name内容是<id, name>。

hadoop fs -cat $HDFS/genders/id_gender1 male2 female3 male4 femalehadoop fs -cat $HDFS/names/id_name1 duan2 meng3 gu4 xin

和之前任务不同的是,这个任务有两个输入日志id_gender和id_name,那么问题来了,mapper如何判读输入来自哪个日志呢(hadoop streaming 只接收标准输入和并输出到标准输出)?答案是:

通过hadoop提供的os.environ['map_input_file']获取输入文件的路径,然后通过文件名来判断(hadoop2.x使用'mapreduce_map_input_file',这里使用map_input_file兼容老版本集群):
file_path = os.environ['map_input_file']

#!/usr/bin/env python# coding:utf-8"""@author: duanmeng@outlook.com@file:   join_id_gender_name.py@bref:   id_name contain('id' 'name'), id_gender contain('id' 'gender')         join this two log file and output ('id', 'name', 'gender')"""import sysimport osdef mapper():    file_path = os.environ['map_input_file']    is_gender = False    if 'id_gender' in file_path:        is_gender = True    for line in sys.stdin:        item = line.strip().split(' ')        if is_gender:            id = item[0]            gender = item[1]            print "%s\t%s" % (id, 'gender ' + gender)        else:            id = item[0]            name = item[1]            print "%s\t%s" % (id, 'name ' + name)def reducer():    last_id = None    last_value = ['', '']    for line in sys.stdin:        item = line.strip().split('\t')        id = item[0]        (type, tv) = tuple(item[1].split(' '))        if last_id and last_id != id:            print "%s\t%s" % (last_id, '\t'.join(last_value))        last_id = id        if type == 'gender':            last_value[1] = tv        else:            last_value[0] = tv    if last_id:        print "%s\t%s" % (last_id, '\t'.join(last_value))if __name__ == '__main__':    type = sys.argv[1]    if type == 'm':        mapper()    elif type == 'r':        reducer()    else:        exit(1)


执行hadoop任务
hadoop streaming -input $HDFS/id_gender -input $HDFS/id_name -output $HDFS/id_gender_name-mapper 'python join_id_gender_name.py m' -reducer 'python join_id_gender_name.py r' -file join_id_gender_name.py -numReduceTasks 1

查看结果
hadoop fs -cat $HDFS/output/id_gender_name/part-00000
1       duan    male
2       meng    female
3       gu      male
4       xin     female 

0 0
原创粉丝点击