使用Hive进行手机流量统计

来源:互联网 发布:工业机械手控制软件 编辑:程序博客网 时间:2024/04/29 03:36

最近面试,发现很多公司在使用hive对数据进行处理。
hive是hadoop家族成员,是一种解析like sql语句的框架。它封装了常用MapReduce任务,让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种,内表和外表。
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除, 而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
Hive的内外表,还有一个Partition的分区的知识点,用于避免全表扫描,快速检索。后期的文章会提到。

接下来开始正式开始《Hive统计手机流量》
原始数据:

1363157985066   13726230503     00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com                24      27      2481    24681   2001363157995052   13826544101     5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4                    4       0       264     0       2001363157991076   13926435656     20-10-7A-28-CC-0A:CMCC  120.196.100.99                  2       4       132     1512    2001363154400022   13926251106     5C-0E-8B-8B-B1-50:CMCC  120.197.40.4                    4       0       240     0       2001363157993044   18211575961     94-71-AC-CD-E6-18:CMCC-EASY     120.196.100.99  iface.qiyi.com  瑙.?缃..        15   2      1527    2106    2001363157995074   84138413        5C-0E-8B-8C-E8-20:7DaysInn      120.197.40.4    122.72.52.12            20      16      4116    1432    2001363157993055   13560439658     C4-17-FE-BA-DE-D9:CMCC  120.196.100.99                  18      15      1116    954     2001363157995033   15920133257     5C-0E-8B-C7-BA-20:CMCC  120.197.40.4    sug.so.360.cn   淇℃.瀹..        20      20   156    2936    200

操作步骤:

#配置好Hive之后,使用hive命令启动hive框架。hive启动属于懒加载模式,会比较慢hive;#使用show databases命令查看当前数据库信息hive> show databases;OKdefaulthive Time taken: 3.389 seconds#使用 use hive命令,使用指定的数据库 hive数据库是我之前创建的use hive#创建表,这里是创建内表。内表加载hdfs上的数据,会将被加载文件中的内容剪切走。#外表没有这个问题,所以在实际的生产环境中,建议使用外表。create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t';#加载数据,这里是从hdfs加载数据,也可用linux下加载数据 需要local关键字load data inpath'/HTTP_20130313143750.dat' into table ll;#数据加载完毕之后,hdfs的#执行hive 的like sql语句,对数据进行统计select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;

执行结果如下:

hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapred.reduce.tasks=<number>Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12013-07-17 19:51:42,599 Stage-1 map = 0%,  reduce = 0%2013-07-17 19:52:40,474 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:41,690 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:42,693 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:43,698 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:44,702 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:45,707 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:46,712 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:47,715 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:48,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:49,758 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:50,763 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 48.5 sec2013-07-17 19:52:51,772 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 sec2013-07-17 19:52:52,775 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 sec2013-07-17 19:52:53,779 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.0 secMapReduce Total cumulative CPU time: 50 seconds 0 msecEnded Job = job_201307160252_0006MapReduce Jobs Launched: Job 0: Map: 1  Reduce: 1   Cumulative CPU: 50.0 sec   HDFS Read: 2787075 HDFS Write: 16518 SUCCESSTotal MapReduce CPU Time Spent: 50 seconds 0 msecOK13402169727171108112861302301341580747720671683169668199418113416127574150110941619638027561341617182011399106303212013417106524160128186881308813418002498240256221368689613418090588456351989346747013418117364264152294364996613418173218376804834822612867315972213418666750224322648213956483973555213420637670202014801480......Time taken: 75.24 seconds

使用MapReduce进行手机流量统计:http://www.maoxiangyi.cn/index.php/archives/256

0 0
原创粉丝点击