Hive 使用RegexSerDe来处理标准格式Apache Web日志
来源:互联网 发布:js给下拉列表赋值 编辑:程序博客网 时间:2024/05/16 07:32
感谢分享:http://blog.csdn.net/niityzu/article/details/42103297
我们以一个例子来介绍如何使用RegexSerDe来处理标准格式的Apache Web日志,并对其进行统计分析。我的Hive版本是apache-hive-0.13.1-bin
一、在Hive中创建表serde_regex
- CREATE TABLE serde_regex(
- host STRING,
- identity STRING ,
- user STRING,
- time STRING,
- request STRING,
- status STRING,
- size STRING,
- referer STRING,
- agent STRING )
- ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
- WITH SERDEPROPERTIES (
- "input.regex" = "([\\d|.]+)\\s+([^ ]+)\\s+([^ ]+)\\s+\\[(.+)\\]\\s+\"([^ ]+)\\s(.+)\\s([^ ]+)\"\\s+([^ ]+)\\s+([^ ]+)\\s+\"(.+)\"\\s+\"(.+)\"?",
- "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
- )
- STORED AS TEXTFILE ;
二、往表中导入数据
Apache Web 日志内容格式见附件下载链接(可下载)
往表中导入事例数据
- hive> LOAD DATA LOCAL INPATH "./data/apache_log.txt" INTO TABLE serde_regex;
三、查询分析
查询分析的时候报了以下错误:主要就是缺少RegexSerDe处理类,也就是说,hive会话环境中缺少该类所在的包。
- hive> select host,request,agent from serde_regex limit 10;
- Total jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks is set to 0 since there's no reduce operator
- Starting Job = job_1419317102229_0001, Tracking URL = http:
- Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0001
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2014-12-23 14:46:15,389 Stage-1 map = 0%, reduce = 0%
- 2014-12-23 14:47:02,249 Stage-1 map = 100%, reduce = 0%
- Ended Job = job_1419317102229_0001 with errors
- Error during job, obtaining debugging information...
- Job Tracking URL: http:
- Examining task ID: task_1419317102229_0001_m_000000 (and more) from job job_1419317102229_0001
-
- Task with the most failures(4):
- -----
- Task ID:
- task_1419317102229_0001_m_000000
-
- URL:
- http:
- -----
- Diagnostic Messages for this Task:
- Error: java.lang.RuntimeException: Error in configuring object
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
- at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
- at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
- at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:425)
- at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
- at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
- at java.security.AccessController.doPrivileged(Native Method)
- at javax.security.auth.Subject.doAs(Subject.java:415)
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
- at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
- Caused by: java.lang.reflect.InvocationTargetException
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
- at java.lang.reflect.Method.invoke(Method.java:606)
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
- ... 9 more
- Caused by: java.lang.RuntimeException: Error in configuring object
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
- at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
- at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
- at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
- ... 14 more
- Caused by: java.lang.reflect.InvocationTargetException
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
- at java.lang.reflect.Method.invoke(Method.java:606)
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
- ... 17 more
- Caused by: java.lang.RuntimeException: Map operator initialization failed
- at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:154)
- ... 22 more
- Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
- at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:335)
- at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:353)
- at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:123)
- ... 22 more
- Caused by: java.lang.ClassNotFoundException: <span style="color:#ff0000;">Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found</span>
- at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
- at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:305)
- ... 24 more
四、异常解决办法
往hive会话中加入 hive-contrib-0.13.1.jar,该包位置在hive安装环境的lib目录下,加入命令如下:
- hive> add jar /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar;
- Added /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar to class path
- Added resource: /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar
再次执行第三步的SELECT查询命令,结果如下:
- hive> select host,request,agent from serde_regex limit 10;
- Total jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks is set to 0 since there's no reduce operator
- Starting Job = job_1419317102229_0002, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0002/
- Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0002
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2014-12-23 14:48:50,163 Stage-1 map = 0%, reduce = 0%
- 2014-12-23 14:49:01,666 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.31 sec
- MapReduce Total cumulative CPU time: 3 seconds 310 msec
- Ended Job = job_1419317102229_0002
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 3.31 sec HDFS Read: 4321 HDFS Write: 238 SUCCESS
- Total MapReduce CPU Time Spent: 3 seconds 310 msec
- OK
- 61.160.224.138 GET 7519
- 61.160.224.138 GET 709
- 61.160.224.138 GET 815
- 113.17.174.44 POST 653
- 61.160.224.138 GET 1670
- 61.160.224.144 GET 2887
- 61.160.224.143 GET 2947
- 61.160.224.145 GET 2581
- 61.160.224.145 GET 2909
- 61.160.224.144 GET 15879
- Time taken: 26.811 seconds, Fetched: 10 row(s)
问题解决,但是该解决方法只能对本次Hive会话有用,Hive使用命令exit退出后再进入依旧会出现该问题。
遗留问题:个人觉得应该有一种办法,将该包加入到Hadoop/lib下,重启集群,可以长久解决该问题。但是没尝试过,待验证。