Python的MapReduce调用及多输入文件的使用(统计url的点击量)

来源:互联网 发布:揭阳淘宝村在哪里 编辑:程序博客网 时间:2024/05/22 12:54
1、在日志中统计对应链接的点击量脚本
由于业务上暂用不到reduce过程,所以只有一个mapper脚本。
/Users/nisj/PycharmProjects/BiDataProc/hitsCalc3/filter_mapperOnly.py
#!/usr/bin/env python# encoding: utf-8import sys# 输入为标准输入stdinfor line in sys.stdin:    # 删除开头和结果的空格    if '/event/apply/template/yhzrsolo.htm?s_=rmhd' in line:        print '%s' % (line)

2、Python的MapReduce调用
2.1、按天统计
即一次统计一天的日志文件,计算链接在一天内的点击量。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.*.txt \-output /nisj/mp_result

2.2、一天内某几个小时的点击量统计
可以使用正则实现需求,中括号里的对应的是一个字符。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \-input /tmp/oss_access/2017-08-2[1-4]/*_localhost_access_log.2017-08-2[1-4].*.txt \-output /nisj/mp_result

2.3、正则及多输入文件实现跨天某几个小时的点击量统计
多输入文件可以是如下两种方式,经测试,它们的结果是一致的。
hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \-input /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.0[0-9].txt \-input /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.1[0-8].txt \-output /nisj/mp_result
hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py -file /home/hadoop/nisj/hitsCalc3/filter_mapperOnly.py \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.0[0-9].txt /tmp/oss_access/2017-08-22/*_localhost_access_log.2017-08-22.1[0-8].txt \-output /nisj/mp_result

另一个的测试:
hadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/xx.py -file /home/hadoop/nisj/hitsCalc3/xx.py \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \-output /nisj/mp_resulthadoop dfs -rm -r -skipTrash /nisj/mp_result;hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \-mapper /home/hadoop/nisj/hitsCalc3/xx.py -file /home/hadoop/nisj/hitsCalc3/xx.py \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.1[8-9].txt \-input /tmp/oss_access/2017-08-21/*_localhost_access_log.2017-08-21.2[0-3].txt \-output /nisj/mp_result

3、结果的最终统计
#过滤出的结果查看:
hadoop dfs -cat /nisj/mp_result/*
#点击量的统计计算
hadoop dfs -cat /nisj/mp_result/* |wc -l
阅读全文
0 0