Hive导入Apache Nginx等日志与分析
来源:互联网 发布:博购软件 编辑:程序博客网 时间:2024/04/28 10:25
Hive导入Apache Nginx等日志与分析
将nginx日志导入到hive中的两种方法
1 在hive中建表
导入后日志格式为
第二种方法导入
注意:这个方法在建表后,使用查询语句等前要先执行
hive> add jar /home/hjl/hive/lib/hive_contrib.jar;
或者设置hive/conf/hive-default.conf 添加
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/Hadoop/hive/lib/hive-contrib-0.7.0-cdh3u0.jar</value>
</property>
保存配置
203.208.60.91 - - [05/May/2011:01:18:47 +0800] "GET /robots.txt HTTP/1.1" 404 1238 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCE 。
导入日志命令
hive>load data local inpath '/home/log/map.gz' overwrite into table log;
导入日志支持.gz等格式
导入日志后进行分析 例句
统计行数
select count(*) from nginxlog;
统计IP数
select count(DISTINCT ip) from nginxlog;
排行
select t2.ip,t2.xx from (SELECT ip, COUNT(*) AS xx FROM nginxlog GROUP by ip) t2 sort by t2.xx desc
hive>SELECT * from apachelog WHERE ipaddress = '216.211.123.184';
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;
hive> set mapred.reduce.tasks=2;
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;
hive>CREATE TABLE ipsummary (ipaddress STRING, numrequest INT);
hive>INSERT OVERWRITE TABLE ipsummary SELECT ipaddress, COUNT(1) FROM apachelog GROUP BY ipaddress;
hive>SELECT ipsummary.ipaddress, ipsummary.numrequest FROM (SELECT MAX(numrequest) AS themax FROM ipsummary) ipsummarymax JOIN ipsummary ON ipsummarymax.themax = ipsummary.numrequest;
hive查询结果导出为csv的方法(未测试)
hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where ... ;
1 在hive中建表
- CREATE TABLE apachelog (ipaddress STRING, identd STRING, user STRING,finishtime STRING,requestline string, returncode INT, size INT,referer string,agent string) ROW FORMAT SERDE'org.apache.Hadoop.hive.serde2.dynamic_type.DynamicSerDe'WITH SERDEPROPERTIES ('serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol','quote.delim'='("|\\[|\\])','field.delim'=' ','serialization.null.format'='-')STORED AS TEXTFILE;
导入后日志格式为
203.208.60.91 - - 05/May/2011:01:18:47 +0800 GET /robots.txt HTTP/1.1 404 1238 Mozilla/5.0
此方法支持hive中函数parse_url(referer,"HOST")
第二种方法导入
注意:这个方法在建表后,使用查询语句等前要先执行
hive> add jar /home/hjl/hive/lib/hive_contrib.jar;
或者设置hive/conf/hive-default.conf 添加
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/Hadoop/hive/lib/hive-contrib-0.7.0-cdh3u0.jar</value>
</property>
保存配置
- CREATE TABLE apilog20110505 (ipaddress STRING,identity STRING,user STRING,time STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING) ROW FORMAT SERDE'org.apache.Hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s")STORED AS TEXTFILE;
203.208.60.91 - - [05/May/2011:01:18:47 +0800] "GET /robots.txt HTTP/1.1" 404 1238 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
此方法中的字段类型string from deserializer 经测试不支持parse_url(referer,"HOST")获取域名
可以用select split(referer,"/")[2] from apilog 获取域名如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCE 。
导入日志命令
hive>load data local inpath '/home/log/map.gz' overwrite into table log;
导入日志支持.gz等格式
导入日志后进行分析 例句
统计行数
select count(*) from nginxlog;
统计IP数
select count(DISTINCT ip) from nginxlog;
排行
select t2.ip,t2.xx from (SELECT ip, COUNT(*) AS xx FROM nginxlog GROUP by ip) t2 sort by t2.xx desc
hive>SELECT * from apachelog WHERE ipaddress = '216.211.123.184';
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;
hive> set mapred.reduce.tasks=2;
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;
hive>CREATE TABLE ipsummary (ipaddress STRING, numrequest INT);
hive>INSERT OVERWRITE TABLE ipsummary SELECT ipaddress, COUNT(1) FROM apachelog GROUP BY ipaddress;
hive>SELECT ipsummary.ipaddress, ipsummary.numrequest FROM (SELECT MAX(numrequest) AS themax FROM ipsummary) ipsummarymax JOIN ipsummary ON ipsummarymax.themax = ipsummary.numrequest;
hive查询结果导出为csv的方法(未测试)
hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where ... ;
0 0
- hive导入apache nginx等日志与分析
- Hive导入Apache Nginx等日志与分析
- hive导入 nginx 或 apache 日志
- hive导入nginx日志
- hive 分析apache日志
- hadoop + hive apache nginx 访问日志分析 (一)
- 用Hive分析nginx日志
- 用Hive分析Nginx日志
- Nginx/Apache日志分析脚本
- Awstats 日志分析 Apache nginx
- hive 分析nginx的access.log日志
- 用python分析apache等web日志
- nginx与apache日志格式的区别
- 使用Hive的正则解析器RegexSerDe分析nginx日志
- 用Hive分析nginx日志——1
- 通过Hive及其Udf函数进行Nginx日志分析
- hive分析nginx日志之UDF清洗数据
- Web服务器- apache与nginx分析
- Oracle连接数查询,处理
- 在Spring下集成ActiveMQ
- win7下libevent实现的rot13服务器
- Android Fragment 深度解析
- 九度OJ 1351 数组中只出现一次的数字 (位操作)
- Hive导入Apache Nginx等日志与分析
- sqlserver 判断当前数据是否存在 不存在则插入
- 查看oracle用户数据库连接数 ,游标数等
- 卢布贬值的前因与后果
- Makefile中= := ?= +=区别
- windows phone关于页面跳转的总结
- ibm aix常用命令
- 从对呼格案辩护律师群起而攻有感而发
- mySql导入导出sql语句(.txt)