用户行为分析之--apache日志分析（二）

来源：互联网发布：php获取服务器内网ip 编辑：程序博客网时间：2024/05/16 10:09

在上一篇“用户行为分析之--apache日志分析（一）”中最后介绍到了apache 的log信息中的爬虫，那么为啥要介绍他呢，无非就是为了达到标题“用户行为分析”的目的，爬虫可不是咱们网站的真正用户，所以要过滤掉他；在过滤他之前咱们不是首先要知道人家长啥样不是！

考虑到开发的便捷性，和各个语言的特长，python非常适合做这种事情，文本的处理，就是把日志中的爬虫信息过滤掉，然后生成xml文件，就是程序可以直接使用的信息（那么xpath,xQuery就可以派上用场了）。

本人一直比较懒，遵循好程序员的三大优点之一“懒惰” ；在网上找到了一段parse代码

def parse (input) : import re,string SB = "[" EB = "]" IP_SEPR = "- -" output = {} try : #clean empty space at the beginning. line = string.lstrip(input) tt = string.split(line,IP_SEPR) [ip,rest] = tt output['ip_address'] = string.strip(ip) #parse the date with the brackets included. s_bracket = string.index(rest,SB) e_bracket = string.index(rest,EB) date_str = string.strip(rest[s_bracket+1:e_bracket]) output['date_time'] = date_str #parse request string to get method, request and protocol. current_ind = e_bracket+1 request_start = -1 request_end = -1 magic_flag = 0 while current_ind < len(rest): if request_start != -1: magic_flag = 1 if rest[current_ind] == "/"" and request_start == -1: request_start = current_ind if rest[current_ind] == "/"" and request_start != -1 and magic_flag == 1: request_end = current_ind if request_start >= 0 and request_end >= 0: break current_ind = current_ind +1 get_str = string.strip(rest[request_start+1:request_end]) [method,request,protocol] = string.split(get_str," ") output['method']= method output['request'] = request output['protocol'] = protocol #parse return code rest = string.strip(rest[request_end+1:]) ret_code_e_ind = string.index(rest," ") ret_code = rest[:ret_code_e_ind] output['return_code'] = ret_code #parse byte sent rest = string.lstrip(rest[ret_code_e_ind+1:]) byte_sent_e_ind = string.index(rest," ") byte_sent = rest[:byte_sent_e_ind] output['return_byte'] = byte_sent #parse refering url after_byte_sent = rest[byte_sent_e_ind+1:] s_quote_ref_url = string.index(after_byte_sent,"/"") after_byte_sent = after_byte_sent[s_quote_ref_url+1:] e_quote_ref_url = string.index(after_byte_sent,"/"") if e_quote_ref_url-s_quote_ref_url==1: output['refering_url'] = "" else: output['refering_url'] = after_byte_sent[:e_quote_ref_url] #parse user agent after_ref_url = after_byte_sent[e_quote_ref_url+1:] s_quote_user_agent = string.index(after_ref_url,"/"") after_ref_url = after_ref_url[s_quote_user_agent+1:] e_quote_user_agent = string.index(after_ref_url,"/"") if e_quote_user_agent - s_quote_user_agent==1: output['user_agent'] = "" else: output['user_agent'] = after_ref_url[:e_quote_user_agent] except Exception, e: print e return outputif __name__ == '__main__': input = """ 10.29.101.5 - - [31/Dec/2008:00:03:16 +0800] "GET /blog_paper.php?paperid=72 HTTP/1.0" 200 46030 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)" 303 46379 """ result = parse(input) print result['return_code'] for x in result.keys(): print x," -- ",result[x]

好了，经过这么解析parse之后就可以得到 python结构的对象了，那么就可以真正加工了，实际上这就是书上说的“数据清洗”了，这里除了要清洗掉爬虫的数据还要清洗掉一些图片或是视频的信息啊，当然要看实际需要情况的。这样清洗完后，后面的计算就可以不去计算那些无关的东西了。