用Shell根据AcessLog统计对应的点击量

来源:互联网 发布:西门子触摸屏编程 编辑:程序博客网 时间:2024/05/16 04:56
环境说明:
AcessLog已经装载到Hdfs上,数据的存储是按日期存放,每天数据又按机器名、小时命名的文件进行存放。正常情况,一天的数据文件2300多个,大小是3~400G。
1、将数据按小时分块,并行计算
执行时长,20分钟左右。目前三种方案中较理想与合理的一种。
#!/bin/bash  for(( i = 0; i < 24; i++ ))  do  {hour2=`printf "%02d\n" $i`hadoop dfs -cat /tmp/oss_access/2017-04-09/*localhost_access_log.2017-04-09.$hour2.txt | grep '/api/opening/screen/get.htm' |wc -l > 0409-$hour2.txt} &  done  wait  datecat 0409-*txt|awk '{for(i=1;i<=NF;i++)a[i]+=$i}END{l=length(a);for(j=1;j<=l;j++) printf a[j]" ";printf "\n"}' >kp-0409.txt

#!/bin/bash  for(( i = 0; i < 24; i++ ))  do  {hour2=`printf "%02d\n" $i`hadoop dfs -cat /tmp/oss_access/2017-04-09/*localhost_access_log.2017-04-09.$hour2.txt | grep '/api/remix-index.htm' |wc -l > 0409-syzs-$hour2.txt} &  done  wait  datecat 0409-syzs*txt|awk '{for(i=1;i<=NF;i++)a[i]+=$i}END{l=length(a);for(j=1;j<=l;j++) printf a[j]" ";printf "\n"}' >syzs-0409.txt

#!/bin/bash  for(( i = 0; i < 24; i++ ))  do  {hour2=`printf "%02d\n" $i`hadoop dfs -cat /tmp/oss_access/2017-04-09/*localhost_access_log.2017-04-09.$hour2.txt | grep '/api/room/get.htm'| grep 'fromView=1'| grep 'fromPos=1' |wc -l > 0409-syjd-$hour2.txt} &  done  wait  datecat 0409-syjd*txt|awk '{for(i=1;i<=NF;i++)a[i]+=$i}END{l=length(a);for(j=1;j<=l;j++) printf a[j]" ";printf "\n"}' >syjd-0409.txt

#!/bin/bash  for(( i = 0; i < 24; i++ ))  do  {hour2=`printf "%02d\n" $i`hadoop dfs -cat /tmp/oss_access/2017-04-09/*localhost_access_log.2017-04-09.$hour2.txt | grep '/api/room/get.htm' | grep 'fromView=3'| grep 'fromPos=1' |wc -l > 0409-ymtt-$hour2.txt} &  done  wait  datecat 0409-ymtt*txt|awk '{for(i=1;i<=NF;i++)a[i]+=$i}END{l=length(a);for(j=1;j<=l;j++) printf a[j]" ";printf "\n"}' >ymtt-0409.txt

2、以变量的相加的方式每文件计算
执行时长1小时20分钟左右,稍慢。
#!/bin/bash  i=0for hdfs_path in `hadoop dfs -ls /tmp/oss_access/2017-04-09/ |awk '{print $8;}'`do{echo $hdfs_pathnum=`hadoop dfs -cat $hdfs_path | grep '/api/room/get.htm' | grep 'fromView=3'| grep 'fromPos=1' |wc -l`i=$[i + num]}doneecho $i

3、每文件计算一次并行计算
直接导致的结果就是报“ fork: retry: Resource temporarily unavailable”,想一下2300多个进程一起开,是很恐怖;还是按小时,并行调度比较靠谱。
#!/bin/bash  for hdfs_path in `hadoop dfs -ls /tmp/oss_access/2017-04-09/ |awk '{print $8;}'`do{#echo $hdfs_pathhdfs_path_like=${hdfs_path//\//-};#echo xx_$hdfs_path_likehadoop dfs -cat $hdfs_path | grep '/api/room/get.htm' | grep 'fromView=3'| grep 'fromPos=1' |wc -l > xx_$hdfs_path_like.txt} &donewaitdatecat xx*txt|awk '{for(i=1;i<=NF;i++)a[i]+=$i}END{l=length(a);for(j=1;j<=l;j++) printf a[j]" ";printf "\n"}' >ymtt-0409-new.txt

0 0