hadoop学习-海量日志分析(提取KPI指标)
来源:互联网 发布:虚拟光驱软件免费下载 编辑:程序博客网 时间:2024/05/18 07:53
1、Web日志分析
从Web日志中,我们可以获取网站各类页面的PV值(PageView,页面访问量),访问IP;或者是用户停留时间最长的页面等等,更复杂的,可以分析用户行为特征。
在Web日志中,每条日志都代表用户的一次访问行为,以下面的一条日志为例子:
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"可以拆分为8个变量:
remote_addr:60.208.6.156//用户IP地址
remote_user:- //用户名称
time_local:[18/Sep/2013:06:49:48 +0000]//记录访问时间
request:"GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0"//记录访问的url与http协议
status:200 //记录请求状态,成功是200
body_bytes_sent:185524//记录发给客户端内容的大小
http_referer:"http://cos.name/category/software/packages/"//记录从哪个页面访问过来的
http_user_agent:"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"//记录客户浏览器的信息
2、KPI指标设计
一般的KPI指标可以设置为:
PV:页面访问量统计
IP:页面独立IP访问数量统计
Time:每小时用户访问数量统计
Source:用户来源域名的统计
Brower:用户访问设备的统计
3、hadoop算法实现
PV:页面访问量统计
Map过程:key: request,value: 1
Reduce过程:key:request,value(求和)
IP:页面独立IP访问数量统计
Map过程:key: request,value: remote_addr
Reduce过程:key:request,value(去重再求和)
Time:每小时用户访问数量统计
Map过程:key: time_local,value: 1
Reduce过程:key:time_local,value(求和)
Map过程:key: http_referer,value: 1
Reduce过程:key:http_referer,value(求和)
Brower:用户访问设备的统计
Map过程:key: http_user_agent,value: 1
Reduce过程:http_user_agent,value(求和)
下面以PV(页面访问量统计)为例,设计MapReduce程序
4、MapReduce程序实现
1).对日志解析
2).Map过程
3).Reduce过程
KPI.java
import java.text.ParseException;import java.text.SimpleDateFormat;import java.util.Date;import java.util.Locale;public class KPI {private String remote_addr; //ip addrprivate String remote_user;//user nameprivate String time_local;private String request;private String status;private String body_bytes_sent;private String http_referer;private String http_user_agent;private boolean valid = true;public String getRemote_addr() { return remote_addr; } public void setRemote_addr(String remote_addr) { this.remote_addr = remote_addr; } public String getRemote_user() { return remote_user; } public void setRemote_user(String remote_user) { this.remote_user = remote_user; } public String getTime_local() { return time_local; } public Date getTime_local_Date() throws ParseException { SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US); return df.parse(this.time_local); } public String getTime_local_Date_hour() throws ParseException{ SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH"); return df.format(this.getTime_local_Date()); } public void setTime_local(String time_local) { this.time_local = time_local; } public String getRequest() { return request; } public void setRequest(String request) { this.request = request; } public String getStatus() { return status; } public void setStatus(String status) { this.status = status; } public String getBody_bytes_sent() { return body_bytes_sent; } public void setBody_bytes_sent(String body_bytes_sent) { this.body_bytes_sent = body_bytes_sent; } public String getHttp_referer() { return http_referer; } public void setHttp_referer(String http_referer) { this.http_referer = http_referer; } public String getHttp_user_agent() { return http_user_agent; } public void setHttp_user_agent(String http_user_agent) { this.http_user_agent = http_user_agent; } public boolean isValid() { return valid; } public void setValid(boolean valid) { this.valid = valid; }public void parser(String line){String[] arr = line.split(" ");if(arr.length > 11){this.setRemote_addr(arr[0]);this.setRemote_user(arr[1]);this.setTime_local(arr[3]);this.setRequest(arr[6]);this.setStatus(arr[8]);this.setBody_bytes_sent(arr[9]);this.setHttp_referer(arr[10]);this.setHttp_user_agent(arr[11]);this.setValid(true);if(Integer.parseInt(this.getStatus()) >= 400){this.setValid(false);}}else{this.setValid(false);}}}KPI类可以解析每一条日志记录,并存储相应信息。
KPIPV.java
import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.io.IntWritable;public class KPIPV { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { KPI kpi = new KPI(); kpi.parser(value.toString()); if (kpi.isValid()){word.set(kpi.getRequest());context.write(word,one);} } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException { int count = 0; for (IntWritable val : values) { count += val.get(); } context.write(key, new IntWritable(count)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); if(otherArgs.length != 2){ System.err.println("Usage: KPI <in> <out>"); System.exit(2); } Job job = new Job(conf, "KPIPV"); job.setJarByClass(KPIPV.class); job.setMapperClass(MapClass.class); //job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); //job.setInputFormat(KeyValueTextInputFormat.class); //job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}本例采用的数据源为个人网站的Web日志(由bsspirit提供),大家也可以通过百度统计获得某个网站的日志。
5、eclipse启动程序
设置输入 输出目录
hdfs://localhost:9000/user/root/access.log.10 hdfs://localhost:9000/user/root/output9
6、结果输出
/r-rserve-nodejs/?cf_action=sync_comments&post_id=17695/r-rserve-nodejs/feed/1/r-rstudio-server/2/r-rstudio-server/?cf_action=sync_comments&post_id=15062/rhadoop-demo-email/3/rhadoop-demo-email/?cf_action=sync_comments&post_id=3081/rhadoop-hadoop2/rhadoop-hadoop/10/rhadoop-hadoop/?cf_action=sync_comments&post_id=872/rhadoop-hadoop/feed/1/rhadoop-hbase-rhase/4/rhadoop-hbase-rhase/?cf_action=sync_comments&post_id=972/rhadoop-hbase-rhase/feed/1/rhadoop-java-basic/3
源代码及数据:https://github.com/y521263/Hadoop_in_Action
参考资料:
http://blog.fens.me/hadoop-mapreduce-log-kpi/
- hadoop学习-海量日志分析(提取KPI指标)
- hadoop学习-海量日志分析(提取KPI指标)
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- 海量Web日志分析 用Hadoop提取KPI统计指标
- Hadoop 提取KPI 进行海量Web日志分析
- Hadoop Mapreduce Kpi 用Hadoop提取KPI统计指标
- 使用Hadoop提取网络日志KPI指标
- 《hadoop进阶》web日志系统 KPI指标的分析与实现
- 大数据Web日志分析 用Hadoop统计KPI指标实例
- 大数据Web日志分析 用Hadoop统计KPI指标实例
- hadoop学习-海量日志分析(二) HBase
- Sublime text2安装php beautifier的方法(windows, xampp开发环境)
- Android修炼之道—Trinea的github项目
- 用户点击超链接直接进入下载图片,文件等
- 内存管理之引用计数
- 触摸事件
- hadoop学习-海量日志分析(提取KPI指标)
- gdb 安装
- 二叉树前驱后继的查找(这个容易理解)
- ZOJ-3171
- IOS中提示could not insert new action connection :could not find any information for the class
- NAND Flash 读写操作
- Java学习笔记_17_static、final、abst修饰符
- Shopaholic 1678
- android handler