学习hadoop(3)join日志
来源:互联网 发布:golang syscall说明 编辑:程序博客网 时间:2024/05/20 07:51
假如在$HDFS/genders有个日志$HDFS/genders/id_gender内容是<id, gender>,$HDFS/names有个日志$HDFS/names/id_name内容是<id, name>。
hadoop fs -cat $HDFS/genders/id_gender1 male2 female3 male4 femalehadoop fs -cat $HDFS/names/id_name1 duan2 meng3 gu4 xin
和之前任务不同的是,这个任务有两个输入日志id_gender和id_name,那么问题来了,mapper如何判读输入来自哪个日志呢(hadoop streaming 只接收标准输入和并输出到标准输出)?答案是:
通过hadoop提供的os.environ['map_input_file']获取输入文件的路径,然后通过文件名来判断(hadoop2.x使用'mapreduce_map_input_file',这里使用map_input_file兼容老版本集群):
file_path = os.environ['map_input_file']
#!/usr/bin/env python# coding:utf-8"""@author: duanmeng@outlook.com@file: join_id_gender_name.py@bref: id_name contain('id' 'name'), id_gender contain('id' 'gender') join this two log file and output ('id', 'name', 'gender')"""import sysimport osdef mapper(): file_path = os.environ['map_input_file'] is_gender = False if 'id_gender' in file_path: is_gender = True for line in sys.stdin: item = line.strip().split(' ') if is_gender: id = item[0] gender = item[1] print "%s\t%s" % (id, 'gender ' + gender) else: id = item[0] name = item[1] print "%s\t%s" % (id, 'name ' + name)def reducer(): last_id = None last_value = ['', ''] for line in sys.stdin: item = line.strip().split('\t') id = item[0] (type, tv) = tuple(item[1].split(' ')) if last_id and last_id != id: print "%s\t%s" % (last_id, '\t'.join(last_value)) last_id = id if type == 'gender': last_value[1] = tv else: last_value[0] = tv if last_id: print "%s\t%s" % (last_id, '\t'.join(last_value))if __name__ == '__main__': type = sys.argv[1] if type == 'm': mapper() elif type == 'r': reducer() else: exit(1)
执行hadoop任务
hadoop streaming -input $HDFS/id_gender -input $HDFS/id_name -output $HDFS/id_gender_name-mapper 'python join_id_gender_name.py m' -reducer 'python join_id_gender_name.py r' -file join_id_gender_name.py -numReduceTasks 1
查看结果
hadoop fs -cat $HDFS/output/id_gender_name/part-00000
1 duan male
2 meng female
3 gu male
4 xin female
1 duan male
2 meng female
3 gu male
4 xin female
0 0
- 学习hadoop(3)join日志
- (hadoop学习-4)Reduce side join
- (hadoop学习-5)Map Side Join
- Hadoop学习日志(1.安装JDK)
- Hadoop学习笔记(五)日志系统
- Hadoop实战学习(2)-日志清洗
- Hadoop学习日志(2.安装配置Hadoop)
- hadoop实现join (CompositeInputFormat)
- Hadoop学习日志之hadoop的组成
- Hadoop Join
- Hadoop学习日志之CAP
- Hadoop 学习研究(五): hadoop中的join操作
- hadoop 学习总结系列 (二 ) 查看日志
- Linq学习(join)
- hadoop学习(3)
- hadoop join之semi join
- hadoop join之semi join
- hadoop学习笔记(3) 初识Hadoop
- android在framework层增加自己的service仿照GPS
- coj 1067: 1 VS 1
- 黑马程序员--java概述
- 简单测试一下go(golang) 和libtask 协程的切换效率
- 矩形合并
- 学习hadoop(3)join日志
- ubuntu安装redis
- 黑马程序员———面向对象之多态、抽象类和接口
- Java设计模式之——单例模式
- 基于OpenCV的目标物体颜色及轮廓的识别方法
- hdu 5289 Assignment(RMQ,单调队列,multiset)
- Socket开发
- coj 1155: |a-b|
- KMP算法详解