用hive分析数据约500M的网页数据

来源:互联网 发布:学校机房网络 编辑:程序博客网 时间:2024/05/22 14:49

最近得到了来自搜狗的用户浏览数据,基于其庞大的数据量,决定用hive去分析这样一份数据

1.创建数据库sogou

hive> createdatabase sogou;

OK

Time taken: 0.149seconds

2、创建内部表sougou

dt                         string(访问时间)                                      

userid                 string  (用户id)                                      

topic                      string    (搜索关键字)                                    

page_num                      int        (该URL在搜索返回结果中的排名)                                   

click_num                      int          (用户点击该网页的顺序号)                                 

url                   string (用户点击的URL)

hive>create table sougou(

    > dt string,

    > userid string,

    > topic string,

    > page_num int,

    > click_num int)

    > row format delimited fields terminatedby '\t';

OK

Time taken: 0.062seconds

3、加载数据到sougou表(此处从本地导入数据)

hive> load datalocal inpath'/opt/hive/sogou.500w.utf8' overwrite into table sougou;

Loading data totable sogou.sougou

Table sogou.sougoustats: [numFiles=1, numRows=0, totalSize=573670020, rawDataSize=0]

OK

Time taken: 49.405seconds

4、创建外部动态分区表sogou1

hive> createexternal table sogou1(

    > dt string,

    > userid string,

    > topic string,

    > page_num int,

    > click_num int,

    > url string)

    > partitioned by(click int)

    > row format delimited fields terminatedby '\t'

    > location '/hivetest/sogou1';

OK

Time taken: 0.128seconds

5、从sogou表查询数据插入到sogou1

hive>insert overwrite table sogou1 partition(click) select * ,click_num from sougou;


分区存储sougou表中相关查询结果


6、可以用count关键字查询sougou数据行数

hive>select count(*) from sougou;

Query ID =root_20171023210127_306f009d-1f93-4355-8e21-75dc1a41fde1

Total jobs = 1

Launching Job 1 outof 1

Number of reducetasks determined at compile time: 1

In order to changethe average load for a reducer (in bytes):

  sethive.exec.reducers.bytes.per.reducer=<number>

In order to limitthe maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set aconstant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job =job_1508755703117_0003, Tracking URL = http://master:8088/proxy/application_1508755703117_0003/

Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0003

Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1

2017-10-2321:01:36,747 Stage-1 map = 0%,  reduce =0%

2017-10-2321:01:54,286 Stage-1 map = 67%,  reduce =0%, Cumulative CPU 17.66 sec

2017-10-2321:02:02,551 Stage-1 map = 100%,  reduce= 0%, Cumulative CPU 20.88 sec

2017-10-2321:02:03,603 Stage-1 map = 100%,  reduce= 100%, Cumulative CPU 22.22 sec

MapReduce Totalcumulative CPU time: 22 seconds 220 msec

Ended Job =job_1508755703117_0003

MapReduce JobsLaunched:

Stage-Stage-1: Map:3  Reduce: 1   Cumulative CPU: 22.22 sec   HDFS Read: 573700855 HDFS Write: 8 SUCCESS

Total MapReduce CPUTime Spent: 22 seconds 220 msec

OK

5000000

Time taken: 38.433seconds, Fetched: 1 row(s)

从下面查询,可看出外部动态分区表sogou1的查询比内表sougou的查询块

hive>select count(*) from sogou1;

Query ID =root_20171023223752_c8ebbbc3-650c-4535-81b7-8cb28223ce6f

Total jobs = 1

Launching Job 1 outof 1

Number of reducetasks determined at compile time: 1

In order to changethe average load for a reducer (in bytes):

  sethive.exec.reducers.bytes.per.reducer=<number>

In order to limitthe maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set aconstant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job =job_1508755703117_0010, Tracking URL = http://master:8088/proxy/application_1508755703117_0010/

Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0010

Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1

2017-10-2322:38:00,842 Stage-1 map = 0%,  reduce =0%

2017-10-2322:38:08,227 Stage-1 map = 33%,  reduce =0%, Cumulative CPU 1.35 sec

2017-10-2322:38:10,308 Stage-1 map = 67%,  reduce =0%, Cumulative CPU 5.1 sec

2017-10-2322:38:17,650 Stage-1 map = 100%,  reduce= 0%, Cumulative CPU 9.22 sec

2017-10-2322:38:19,736 Stage-1 map = 100%,  reduce= 100%, Cumulative CPU 10.51 sec

MapReduce Totalcumulative CPU time: 10 seconds 510 msec

Ended Job =job_1508755703117_0010

MapReduce JobsLaunched:

Stage-Stage-1: Map:3  Reduce: 1   Cumulative CPU: 10.51 sec   HDFS Read: 573703475 HDFS Write: 8 SUCCESS

Total MapReduce CPUTime Spent: 10 seconds 510 msec

OK

5000000

Timetaken: 28.49 seconds, Fetched: 1 row(s)

hive>select count(*) from sougou;

Query ID =root_20171023223830_aeb85da5-4740-4116-b82e-b73207ed822d

Total jobs = 1

Launching Job 1 outof 1

Number of reducetasks determined at compile time: 1

In order to changethe average load for a reducer (in bytes):

  sethive.exec.reducers.bytes.per.reducer=<number>

In order to limitthe maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set aconstant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job =job_1508755703117_0011, Tracking URL = http://master:8088/proxy/application_1508755703117_0011/

Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0011

Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1

2017-10-2322:38:36,701 Stage-1 map = 0%,  reduce =0%

2017-10-2322:38:45,034 Stage-1 map = 33%,  reduce =0%, Cumulative CPU 2.33 sec

2017-10-2322:38:46,064 Stage-1 map = 67%,  reduce =0%, Cumulative CPU 6.04 sec

2017-10-2322:38:56,396 Stage-1 map = 78%,  reduce =0%, Cumulative CPU 9.67 sec

2017-10-2322:38:57,454 Stage-1 map = 100%,  reduce= 22%, Cumulative CPU 10.25 sec

2017-10-2322:38:59,523 Stage-1 map = 100%,  reduce= 100%, Cumulative CPU 11.33 sec

MapReduce Totalcumulative CPU time: 11 seconds 330 msec

Ended Job =job_1508755703117_0011

MapReduce JobsLaunched:

Stage-Stage-1: Map:3  Reduce: 1   Cumulative CPU: 11.33 sec   HDFS Read: 573701453 HDFS Write: 8 SUCCESS

Total MapReduce CPUTime Spent: 11 seconds 330 msec

OK

5000000

Timetaken: 29.912 seconds, Fetched: 1 row(s)

hive>desc sougou;

OK

dt                         string                                        

userid                     string                                        

topic                      string                                        

page_num                   int                                           

click_num                  int                                           

url                        string                                        

Timetaken: 0.148 seconds, Fetched: 6 row(s)

hive>desc sogou1;

OK

dt                         string                                        

userid                     string                                        

topic                      string                                        

page_num                   int                                           

click_num                  int                                           

url                        string                                        

click                      int                                           

         

#Partition Information                

#col_name                   data_type                  comment            

         

click                      int                                           

Timetaken: 0.099 seconds, Fetched: 12 row(s)


7、查询包含baidu的数据

hive> select *from sogou1 where url like"%baidu%";

此处截取部分查询结果


8、查询结果排名第1,点击次序排第2的数据

hive> select *from sogou1 where page_num=1 and click_num=2;

此处截取部分查询结果

9、按照userid号进行归组,并按照查询次数进行排序,最终显示查询次数最多的前10条。

hive>select userid,count(*) t from sogou1 group by userid  order by t desc limit 10;





原创粉丝点击