用hive分析数据约500M的网页数据
来源:互联网 发布:学校机房网络 编辑:程序博客网 时间:2024/05/22 14:49
最近得到了来自搜狗的用户浏览数据,基于其庞大的数据量,决定用hive去分析这样一份数据
1.创建数据库sogou
hive> createdatabase sogou;
OK
Time taken: 0.149seconds
2、创建内部表sougou
dt string(访问时间)
userid string (用户id)
topic string (搜索关键字)
page_num int (该URL在搜索返回结果中的排名)
click_num int (用户点击该网页的顺序号)
url string (用户点击的URL)
hive>create table sougou(
> dt string,
> userid string,
> topic string,
> page_num int,
> click_num int)
> row format delimited fields terminatedby '\t';
OK
Time taken: 0.062seconds
3、加载数据到sougou表(此处从本地导入数据)
hive> load datalocal inpath'/opt/hive/sogou.500w.utf8' overwrite into table sougou;
Loading data totable sogou.sougou
Table sogou.sougoustats: [numFiles=1, numRows=0, totalSize=573670020, rawDataSize=0]
OK
Time taken: 49.405seconds
4、创建外部动态分区表sogou1
hive> createexternal table sogou1(
> dt string,
> userid string,
> topic string,
> page_num int,
> click_num int,
> url string)
> partitioned by(click int)
> row format delimited fields terminatedby '\t'
> location '/hivetest/sogou1';
OK
Time taken: 0.128seconds
5、从sogou表查询数据插入到sogou1表
hive>insert overwrite table sogou1 partition(click) select * ,click_num from sougou;
分区存储sougou表中相关查询结果
6、可以用count关键字查询sougou数据行数
hive>select count(*) from sougou;
Query ID =root_20171023210127_306f009d-1f93-4355-8e21-75dc1a41fde1
Total jobs = 1
Launching Job 1 outof 1
Number of reducetasks determined at compile time: 1
In order to changethe average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limitthe maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set aconstant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job =job_1508755703117_0003, Tracking URL = http://master:8088/proxy/application_1508755703117_0003/
Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0003
Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1
2017-10-2321:01:36,747 Stage-1 map = 0%, reduce =0%
2017-10-2321:01:54,286 Stage-1 map = 67%, reduce =0%, Cumulative CPU 17.66 sec
2017-10-2321:02:02,551 Stage-1 map = 100%, reduce= 0%, Cumulative CPU 20.88 sec
2017-10-2321:02:03,603 Stage-1 map = 100%, reduce= 100%, Cumulative CPU 22.22 sec
MapReduce Totalcumulative CPU time: 22 seconds 220 msec
Ended Job =job_1508755703117_0003
MapReduce JobsLaunched:
Stage-Stage-1: Map:3 Reduce: 1 Cumulative CPU: 22.22 sec HDFS Read: 573700855 HDFS Write: 8 SUCCESS
Total MapReduce CPUTime Spent: 22 seconds 220 msec
OK
5000000
Time taken: 38.433seconds, Fetched: 1 row(s)
从下面查询,可看出外部动态分区表sogou1的查询比内表sougou的查询块
hive>select count(*) from sogou1;
Query ID =root_20171023223752_c8ebbbc3-650c-4535-81b7-8cb28223ce6f
Total jobs = 1
Launching Job 1 outof 1
Number of reducetasks determined at compile time: 1
In order to changethe average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limitthe maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set aconstant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job =job_1508755703117_0010, Tracking URL = http://master:8088/proxy/application_1508755703117_0010/
Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0010
Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1
2017-10-2322:38:00,842 Stage-1 map = 0%, reduce =0%
2017-10-2322:38:08,227 Stage-1 map = 33%, reduce =0%, Cumulative CPU 1.35 sec
2017-10-2322:38:10,308 Stage-1 map = 67%, reduce =0%, Cumulative CPU 5.1 sec
2017-10-2322:38:17,650 Stage-1 map = 100%, reduce= 0%, Cumulative CPU 9.22 sec
2017-10-2322:38:19,736 Stage-1 map = 100%, reduce= 100%, Cumulative CPU 10.51 sec
MapReduce Totalcumulative CPU time: 10 seconds 510 msec
Ended Job =job_1508755703117_0010
MapReduce JobsLaunched:
Stage-Stage-1: Map:3 Reduce: 1 Cumulative CPU: 10.51 sec HDFS Read: 573703475 HDFS Write: 8 SUCCESS
Total MapReduce CPUTime Spent: 10 seconds 510 msec
OK
5000000
Timetaken: 28.49 seconds, Fetched: 1 row(s)
hive>select count(*) from sougou;
Query ID =root_20171023223830_aeb85da5-4740-4116-b82e-b73207ed822d
Total jobs = 1
Launching Job 1 outof 1
Number of reducetasks determined at compile time: 1
In order to changethe average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limitthe maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set aconstant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job =job_1508755703117_0011, Tracking URL = http://master:8088/proxy/application_1508755703117_0011/
Kill Command =/usr/local/hadoop-2.6.4/bin/hadoop job -kill job_1508755703117_0011
Hadoop jobinformation for Stage-1: number of mappers: 3; number of reducers: 1
2017-10-2322:38:36,701 Stage-1 map = 0%, reduce =0%
2017-10-2322:38:45,034 Stage-1 map = 33%, reduce =0%, Cumulative CPU 2.33 sec
2017-10-2322:38:46,064 Stage-1 map = 67%, reduce =0%, Cumulative CPU 6.04 sec
2017-10-2322:38:56,396 Stage-1 map = 78%, reduce =0%, Cumulative CPU 9.67 sec
2017-10-2322:38:57,454 Stage-1 map = 100%, reduce= 22%, Cumulative CPU 10.25 sec
2017-10-2322:38:59,523 Stage-1 map = 100%, reduce= 100%, Cumulative CPU 11.33 sec
MapReduce Totalcumulative CPU time: 11 seconds 330 msec
Ended Job =job_1508755703117_0011
MapReduce JobsLaunched:
Stage-Stage-1: Map:3 Reduce: 1 Cumulative CPU: 11.33 sec HDFS Read: 573701453 HDFS Write: 8 SUCCESS
Total MapReduce CPUTime Spent: 11 seconds 330 msec
OK
5000000
Timetaken: 29.912 seconds, Fetched: 1 row(s)
hive>desc sougou;
OK
dt string
userid string
topic string
page_num int
click_num int
url string
Timetaken: 0.148 seconds, Fetched: 6 row(s)
hive>desc sogou1;
OK
dt string
userid string
topic string
page_num int
click_num int
url string
click int
#Partition Information
#col_name data_type comment
click int
Timetaken: 0.099 seconds, Fetched: 12 row(s)
7、查询包含baidu的数据
hive> select *from sogou1 where url like"%baidu%";
此处截取部分查询结果
8、查询结果排名第1,点击次序排第2的数据
hive> select *from sogou1 where page_num=1 and click_num=2;
此处截取部分查询结果
9、按照userid号进行归组,并按照查询次数进行排序,最终显示查询次数最多的前10条。
hive>select userid,count(*) t from sogou1 group by userid order by t desc limit 10;
- 用hive分析数据约500M的网页数据
- 数据分析系统Hive
- hive数据分析
- 使用HiVE分析数据
- 数据分析系统Hive
- 数据归约
- 数据归约
- awstat网页数据分析
- 网页数据分析工具
- jsoup 分析网页数据
- Hive 股票数据SQL分析[Hive 案例]
- hive+python数据分析入门
- Hive-数据分析和数据仓库
- 利用Hive进行数据分析
- 利用Hive进行数据分析
- [大数据]使用hive做数据分析
- HIVE的数据操作
- Hive的数据存储
- 2017.11.11
- 线程池
- 具有商业价值的采集项目,100%原创
- BP神经网络与卷积神经网络(CNN)
- 21.Struts2_CRUD操作(1)查询和删除
- 用hive分析数据约500M的网页数据
- Ubuntu黑屏无法登陆
- 2017-11-16 shell脚本 (二)
- ROS Hector mapping
- java中的void修饰的方法,方法重载,方法参数传递问题
- DirectX11笔记(四)--渲染管线
- 脱壳学习笔记(三)
- PCA简化数据
- 几分钟搞定二叉排序树的删除节点算法漫谈无代码