[源码分享] HIVE表数据量统计&邮件

来源:互联网 发布:服务器日志分析软件 编辑:程序博客网 时间:2024/06/10 17:48

概要:

计算HIVE BI库下每天数据表总大小及增量

输出:

总大小:xxxG日同比新增数据量:xxxG周同比新增数据量:xxxG月同比新增数据量:xxxG总表数:xxx日新增表数:xxx周新增表数:xxx月新增表数:xxx最大的20张表:......表数据增量TOP20:......
代码:
bi_report.sh

#!/bin/bashGIGA=1000000000content="\n"prefix='bi'today=`date -d"-1 day" +%Y-%m-%d`yestoday=`date -d"-2 day" +%Y-%m-%d`lastweek=`date -d"-8 day" +%Y-%m-%d`lastmonth=`date -d"-1 month -1 day" +%Y-%m-%d`#此处需要做重试,否则,当有数据被删除的时候,统计失败hadoop fs -du /user/hive/warehouse/bi.db/  > ./tmp.bi.db.statwhile [ $? -ne 0 ]do        hadoop fs -du /user/hive/warehouse/bi.db/  > ./tmp.bi.db.statdone#dpstg开头的表不做统计grep -v dpstg ./tmp.bi.db.stat > bi.db.stat awk '{if(NR!=1) {size=split($2,table_list,"/"); print table_list[size],"\t",$1}}' ./bi.db.stat > $prefix.$todaysum=`awk '{sum+=$2} END{printf("%d",sum/"'$GIGA'")}' $prefix.$today`content+="总大小:"$sum"G\n"yes_sum=`awk '{sum+=$2} END{print sum/"'$GIGA'"}' $prefix.$yestoday`diff_size=`echo "$sum-$yes_sum" | bc`content+="日同比新增数据量:"$diff_size"G\n"week_sum=`awk '{sum+=$2} END{print sum/"'$GIGA'"}' $prefix.$lastweek`diff_size=`echo "$sum-$week_sum" | bc`content+="周同比新增数据量:"$diff_size"G\n"month_sum=`awk '{sum+=$2} END{print sum/"'$GIGA'"}' $prefix.$lastmonth`diff_size=`echo "$sum-$month_sum" | bc`content+="月同比新增数据量:"$diff_size"G\n"table_count=`wc -l $prefix.$today | awk '{print $1}'` content+="\n\n总表数:"$table_count"\n"yes_table_count=`wc -l $prefix.$yestoday | awk '{print $1}'` diff_table_count=`echo "$table_count-$yes_table_count" | bc`content+="日新增表数:"$diff_table_count"\n"lastweek_table_count=`wc -l $prefix.$lastweek | awk '{print $1}'` diff_table_count=`echo "$table_count-$lastweek_table_count" | bc`content+="周新增表数:"$diff_table_count"\n"lastmonth_table_count=`wc -l $prefix.$lastmonth | awk '{print $1}'` diff_table_count=`echo "$table_count-$lastmonth_table_count" | bc`content+="月新增表数:"$diff_table_count"\n"max_table="`sort -r -n -k 2 $prefix.$today | head -20 | awk '{ORS="";print $1":" $2/"'$GIGA'""G\\\\n"}'`"content+="\n\n最大的20张表:\n"$max_table"\n"declare -A ARR1while read linedokey=`echo $line|awk '{print $1}'`value=`echo $line|awk '{print $2}'`ARR1["$key"]=$valuedone < $prefix.$todaydeclare -A ARR2while read linedokey=`echo $line|awk '{print $1}'`value=`echo $line|awk '{print $2}'`ARR2["$key"]=$valuedone < $prefix.$yestodayfor k in "${!ARR1[@]}"dod_val=`echo "${ARR1[${k}]}-${ARR2[${k}]}"|bc` ARR1["$k"]=$d_valdonecat /dev/null > bi_tmp.txtfor k in "${!ARR1[@]}"do    echo "$k ${ARR1[${k}]}" >> bi_tmp.txtdonemax_incre_table="`sort -r -n -k 2 bi_tmp.txt | head -20 | awk '{ORS="";print $1":" $2/"'$GIGA'""G\\\\n"}'`"content+="\n\n表数据增量TOP20:\n"$max_incre_table"\n"#echo -e $contentmsg=`echo $content`python mail.py "HIVE BI每日统计" $msg
mail.py
#! /usr/bin/pythonimport sysimport optparseimport osimport smtplibfrom email.mime.text import MIMEText## mail server configmail_host="mail.51ping.com"mail_user="info"mail_pass="chinese1"mail_postfix="51ping.com"######################def send_mail(to_list,sub,content):    me=mail_user+"<"+mail_user+"@"+mail_postfix+">"    msg = MIMEText(content)    msg['Subject'] = sub    msg['From'] = me    msg['To'] = ";".join(to_list)    try:        s = smtplib.SMTP()        s.connect(mail_host)        s.login(mail_user,mail_pass)        s.sendmail(me, to_list, msg.as_string())        s.close()        return True    except Exception, e:        print str(e)        return Falsemail_to_list=["zxmever@gmail.com"]msg=sys.argv[2]msg=msg.replace("\\n","\r\n")print msgsend_mail(mail_to_list,sys.argv[1],msg)

执行:
sh bi_report.sh

原创粉丝点击