用Python编写MapReduce代码与调用-统计accessLog中链接所对应的UV
来源:互联网 发布:福州公司网络管理制员 编辑:程序博客网 时间:2024/05/19 22:07
1、mapper
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/hitCalc_mapper.py
2、reducer
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/hitCalc_reducer.py
3、测试、数据查看与验证
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/mpBat.sh
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/hitCalc_mapper.py
#!/usr/bin/env python# encoding: utf-8import sysimport re# 输入为标准输入stdinfor line in sys.stdin: if '/api/opening/screen/get.htm' in line and '_identifier=' in line: # 删除开头和结果的空格 line = line.strip() # 以默认空格分隔行单词到words列表 words = re.split('\?|&| ', line) for word in words: if '_identifier' in word: # 输出所有单词,格式为“单词,1”以便作为reduce的输入 print '%s|-Y-|%s' % (word, 1)
2、reducer
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/hitCalc_reducer.py
#!/usr/bin/env python# encoding: utf-8import syscurrent_word = Nonecurrent_count = 0word = None#获取标准输入,即mapper.py的输出for line in sys.stdin: line = line.strip() #解析mapper.py输出作为程序的输入,以tab作为分隔符 word, count = line.split('|-Y-|', 1) #转换count从字符型成整型 try: count = int(count) except ValueError: #非字符时忽略此行 continue #要求mapper.py的输出做排序(sort)操作,以便对连续的word做判断 if current_word == word: current_count += count else: if current_word: #输出当前word统计结果到标准输出 print '%s\t%s' %(current_word, current_count) current_count =count current_word =word#输出最后一个word统计if current_word ==word: print '%s\t%s' % (current_word, current_count)
3、测试、数据查看与验证
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/mpBat.sh
#!/usr/bin/env bash#管道测试hadoop dfs -cat /tmp/oss_access/2017-07-31/sz-98-72_localhost_access_log.2017-07-31.23.txt |python hitCalc_mapper.py|sort -k1,1|python hitCalc_reducer.py#老集群测试hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py \-reducer /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py \-input /tmp/oss_access/2017-07-30/*_localhost_access_log.2017-07-30.*.txt \-output /nisj/mp_resulthadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py \-reducer /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py \-input /tmp/oss_access/2017-07-31/sz-98-72_localhost_access_log.2017-07-31.23.txt \-output /nisj/mp_result#新集群测试hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py \-reducer /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py \-input /tmp/oss_access/2017-05-17/sz-98-72_localhost_access_log.2017-05-17.23.txt \-output /nisj/mp_resulthadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_mapper.py \-reducer /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py -file /home/hadoop/nisj/hitsCalc3/hitCalc_reducer.py \-input /tmp/oss_access/2017-05-17/*_localhost_access_log.2017-05-17.*.txt \-output /nisj/mp_result#结果数据查看hadoop dfs -cat /nisj/mp_result/*hadoop dfs -cat /tmp/oss_access/2017-05-17/sz-98-72_localhost_access_log.2017-05-17.23.txt |grep "/api/opening/screen/get.htm" |morehadoop dfs -cat /tmp/oss_access/2017-05-17/sz-98-72_localhost_access_log.2017-05-17.23.txt |python hitCalc_mapper.py |more#uv的计算hadoop dfs -cat /nisj/mp_result/* |wc -l#结果建外表部验证drop table if exists xx_mp_result;CREATE EXTERNAL TABLE xx_mp_result( id string, cnt string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'WITH SERDEPROPERTIES ( 'field.delim'='\t', 'serialization.format'=',')STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION '/nisj/mp_result' ;select count(*) from xx_mp_result;select trim(id),count(*) from xx_mp_result group by trim(id) having count(*) > 1;
阅读全文
0 0
- 用Python编写MapReduce代码与调用-统计accessLog中链接所对应的UV
- 用Python编写MapReduce代码与调用-统计accessLog中链接的点击量
- 用Python编写MapReduce代码与调用-某一天之前的所有活跃用户统计(1)
- 用Python编写MapReduce代码与调用-某一天之前的所有活跃用户统计(2)
- Python根据AccessLog统计对应Url的点击量
- Python根据AccessLog统计对应Url的点击量2
- mapreduce 统计PV UV
- Hadoop:使用原生python编写MapReduce来统计文本文件中所有单词出现的频率功能
- python统计pv、uv
- C++链接库的编写与调用
- Ubuntu中查找与Launcher图标所对应的命令
- ctypes: 使用python调用C编写的动态链接库
- python 类的编写与调用
- 如何调用你所分享的python代码
- 用Python编写C\C++代码统计工具
- Python的MapReduce调用及多输入文件的使用(统计url的点击量)
- MFC 拓展链接库DLL的编写与调用
- vs2010中动态链接库的编写和调用
- House Prices (1):python 探索性数据分析
- keepalived双机热备故障时发送邮件
- android 对activity进行管理
- web工程中读取txt文件中二维数组并在页面显示
- java.sql.SQLException: Access denied for user 'Administrator'@'192.168.5.103' (using password: YES)
- 用Python编写MapReduce代码与调用-统计accessLog中链接所对应的UV
- POJ1845-Sumdiv (对A进行素因子分解+A^B的所有约数之和为+A^B的所有约数之和为+A^B的所有约数之和为)
- @RuquestMapping和@Pathvariable
- 希尔排序1
- jquery hover 不停闪动 解决(亦为stop()的使用)
- 报错处理:Expression parameters.formName is undefined on line xx, column xx in xx/xx/doubleselect.ft
- 诡异的二叉树的中序遍历二
- HashMap的存储原理
- Python基础(7)——名片管理系统(实现了数据简单的存储、修改、删除、查看等)