003 利用hadoop+hive离线处理日志-方案分析

来源:互联网 发布:js是什么牌子的衣服 编辑:程序博客网 时间:2024/05/16 23:19
背景: 数据来自电商网站用户行为数据。对门户访问日志分析处理。
技术方案: 利用hadoop+hive离线处理日志,生成PV和UV结果,统计分析的用户行为日志格式
"06/Jul/2015:00:01:04 +0800" "GET" "http%3A//jf.10086.cn/m/" "HTTP/1.1" "200" "http://jf.10086.cn/m/subject/100000000000009_0.html" "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; Lenovo A3800-d Build/LenovoA3800-d) AppleWebKit/533.1 (KHTML, like Gecko)Version/4.0 MQQBrowser/5.4 TBS/025438 Mobile Safari/533.1 MicroMessenger/6.2.0.70_r1180778.561 NetType/cmnet Language/zh_CN" "10.139.198.176" "480x854" "24" "%u5927%u7C7B%u5217%u8868%u9875_%u4E2D%u56FD%u79FB%u52A8%u79EF%u5206%u5546%u57CE" "0" "3037487029517069460000" "3037487029517069460000" "1"  "75""06/Jul/2015:01:01:04 +0800" "GET" "http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchWareInfo%26pager.offset%3D144" "HTTP/1.1" "200""http://jf.10086.cn/portal/ware/web/SearchWareAction?action=searchWareInfo&pager.offset=156" "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-CN; HUAWEI MT2-L01 Build/HuaweiMT2-L01) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 UCBrowser/10.5.2.598 U3/0.8.0 Mobile Safari/534.30" "223.73.104.224" "720x1208" "32""%u641C%u7D22_%u4E2D%u56FD%u79FB%u52A8%u79EF%u5206%u5546%u57CE" "0" "3046252153674140570000" "3046252153674140570000" "1" "2699""06/Jul/2015:02:01:04 +0800" "GET" "" "HTTP/1.1" "200" "http://jf.10086.cn/" "Mozilla/5.0 (Linux; Android 4.4.4; vivo Y13L Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0  Chrome/33.0.0.0 Mobile Safari/537.36 baiduboxapp/5.1 (Baidu; P1 4.4.4)" "10.154.210.240" "480x855" "32" "%u9996%u9875_%u4E2D%u56FD%u79FB%u52A8%u79EF%u5206%u5546%u57CE" "0""3098781670304015290000" "3098781670304015290000" "0" "831""06/Jul/2015:03:01:07 +0800" "GET" "http%3A//wx.10086.cn/wechat-website/wechatwebsite/AccumulatePoints" "HTTP/1.1" "200" "http://jf.10086.cn/m/" "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; Lenovo A3800-d Build/LenovoA3800-d) AppleWebKit/533.1 (KHTML, like Gecko)Version/4.0 MQQBrowser/5.4 TBS/025438 Mobile Safari/533.1 MicroMessenger/6.2.0.70_r1180778.561 NetType/cmnet Language/zh_CN" "10.139.198.176" "480x854" "24" "%u9996%u9875_%u4E2D%u56FD%u79FB%u52A8%u79EF%u5206%u5546%u57CE" "0" "3037487029517069460000" "3037487029517069460000" "1" "135"



数据来源,可以参考下面的网站
http://jf.10086.cn/analyzeVesopera.gif?screenSize=1366x768&screenColor=24&pageTitle=%u9996%u9875_%u4E2D%u56FD%u79FB%u52A8%u79EF%u5206%u5546%u57CE&referrerPage=&siteType=0&uid=20523849176242946000&sid=56080848979763680000&sflag=1&countlog=1443006061700&onloadTotalTime=135
技术步骤:
1、搭建Hadoop集群,离线日志文件批量处理
hadoop 集群的安装请参考: http://blog.csdn.net/shenfuli/article/category/2803453
hive的安装请参考: http://blog.csdn.net/shenfuli/article/category/5017631
hbase相关请参考:http://blog.csdn.net/shenfuli/article/category/5570409


2、通过MapReduce程序对日志增强

3、通过Hive脚本形成业务数据

4、通过web应用程序展示数据
0 0