hive基础操作

来源：互联网发布：紫胤真人知乎编辑：程序博客网时间：2024/05/29 07:34

启动dfs和yarn

start-dfs.shstart-yarn.sh

hive操作

hivehive>show databases;defaulthive>create database test;hive>show databases;defaulttesthive>use test;

内部表

drop表时存储在hdfs上的文件也会被删除

示例：学生表student，选修关系表sc，课程表course

hive>create table student(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;hive>create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;hive>create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;

创建的数据库保存在表DBS中
hdfs上的路径：
/user/hive/warehouse/test.db
其中/user/hive/warehouse在hive-site.xml文件中指定的

创建的表在TBLS中
hdfs上的路径：
/user/hive/warehouse/test.db/student

导入数据

创建文件
student.txt

95001,李勇,男,20,CS95002,刘晨,女,19,IS95003,王敏,女,22,MA95004,张立,男,19,IS95005,刘刚,男,18,MA95006,孙庆,男,23,CS95007,易思玲,女,19,MA95008,李娜,女,18,CS95009,梦圆圆,女,18,MA95010,孔小涛,男,19,CS95011,包小柏,男,18,MA95012,孙花,女,20,CS95013,冯伟,男,21,CS95014,王小丽,女,19,CS95015,王君,男,18,MA95016,钱国,男,21,MA95017,王风娟,女,18,IS95018,王一,女,19,IS95019,邢小丽,女,19,IS95020,赵钱,男,21,IS95021,周二,男,17,MA95022,郑明,男,20,MA

course.txt

1,数据库2,数学3,信息系统4,操作系统5,数据结构6,数据处理

sc.txt

95001,1,8195001,2,8595001,3,8895001,4,7095002,2,9095002,3,8095002,4,7195002,5,6095003,1,8295003,3,9095003,5,10095004,1,8095004,2,9295004,4,9195004,5,7095005,1,7095005,2,9295005,3,9995005,6,8795006,1,7295006,2,6295006,3,10095006,4,5995006,5,6095006,6,9895007,3,6895007,4,9195007,5,9495007,6,7895008,1,9895008,3,8995008,6,9195009,2,8195009,4,8995009,6,10095010,2,9895010,5,9095010,6,8095011,1,8195011,2,9195011,3,8195011,4,8695012,1,8195012,3,7895012,4,8595012,6,9895013,1,9895013,2,5895013,4,8895013,5,9395014,1,9195014,2,10095014,4,9895015,1,9195015,3,5995015,4,10095015,6,9595016,1,9295016,2,9995016,4,8295017,4,8295017,5,10095017,6,5895018,1,9595018,2,10095018,3,6795018,4,7895019,1,7795019,2,9095019,3,9195019,4,6795019,5,8795020,1,6695020,2,9995020,5,9395021,2,9395021,5,9195021,6,9995022,3,6995022,4,9395022,5,8295022,6,100

从本地导入数据
注意创建表结构是指定的分割符’,’与student.txt的行分割符’,’对应

hive>load data local inpath '/home/bingo/data/hive/students.txt' into table student;hive>load data local inpath '/home/bingo/data/hive/course.txt' overwrite into table course;hive>load data local inpath '/home/bingo/data/hive/sc.txt' overwrite into table sc;

基本操作：

查询全体学生的学号与姓名

select Sno,Sname from student;

查询选修了课程的学生姓名

select distinct Sname from student inner join sc on student.Sno=Sc.Sno;

—-hive的group by 和集合函数

查询学生的总人数

select count(distinct Sno) count from student;

计算1号课程的学生平均成绩

select avg(distinct Grade) from sc where Cno=1;

查询各科成绩平均分

select Cno,avg(Grade) from sc group by Cno;

查询选修1号课程的学生最高分数

select Grade from sc where Cno=1 sort by Grade desc limit 1;

比较:
select * from sc where Cno=1 sort by Grade
select Grade from sc where Cno=1 order by Grade

求各个课程号及相应的选课人数

select Cno,count(1) from sc group by Cno;

查询选修了3门以上的课程的学生学号

select Sno from (select Sno,count(Cno) CountCno from sc group by Sno)a where a.CountCno>3;或select Sno from sc group by Sno having count(Cno)>3;

—-hive的Order By/Sort By/Distribute By
　　Order By ，在strict 模式下（hive.mapred.mode=strict),order by 语句必须跟着limit语句，但是在nonstrict下就不是必须的，这样做的理由是必须有一个reduce对最终的结果进行排序，如果最后输出的行数过多，一个reduce需要花费很长的时间。只有一个reduce，如果数据量大，效率很低，甚至程序会崩掉

查询学生信息，结果按学号全局有序

set hive.mapred.mode=strict;   <默认nonstrict>select Sno from student order by Sno;FAILED: Error in semantic analysis: 1:33 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'Sno'

　　Sort By，它通常发生在每一个redcue里，“order by” 和“sort by”的区别在于，前者能给保证输出都是有顺序的，而后者如果有多个reduce的时候只是保证了输出的部分有序。set mapred.reduce.tasks=在sort by可以指定，在用sort by的时候，如果没有指定列，它会随机的分配到不同的reduce里去。distribute by 按照指定的字段对数据进行划分到不同的输出reduce中
　　此方法会根据性别划分到不同的reduce中，然后按年龄排序并输出到不同的文件中。

查询学生信息，按性别分区，在分区内按年龄有序

set mapred.reduce.tasks=2;insert overwrite local directory '/home/bingo/out' select * from student distribute by Sex sort by Sage;

—-Join查询,join只支持等值连接
查询每个学生及其选修课程的情况

select student.*,sc.* from student join sc on (student.Sno =sc.Sno);

查询学生的得分情况。

select student.Sname,course.Cname,sc.Grade from student join sc on student.Sno=sc.Sno join course on sc.cno=course.cno;

查询选修2号课程且成绩在90分以上的所有学生。

select student.Sname,sc.Grade from student join sc on student.Sno=sc.Sno where  sc.Cno=2 and sc.Grade>90;

—-LEFT，RIGHT 和 FULL OUTER JOIN ,inner join, left semi join
查询所有学生的信息，如果在成绩表中有成绩，则输出成绩表中的课程号

select student.Sname,sc.Cno from student left outer join sc on student.Sno=sc.Sno;

　　如果student的sno值对应的sc在中没有值，则会输出student.Sname null.如果用right out join会保留右边的值，左边的为null。
　　Join 发生在WHERE 子句之前。如果你想限制 join 的输出，应该在 WHERE 子句中写过滤条件——或是在join 子句中写。
　　
—-LEFT SEMI JOIN Hive 当前没有实现 IN/EXISTS 子查询，可以用 LEFT SEMI JOIN 重写子查询语句

重写以下子查询为LEFT SEMI JOIN

SELECT a.key, a.value  FROM a  WHERE a.key exist in   (SELECT b.key    FROM B);#可以被重写为：   SELECT a.key, a.val   FROM a LEFT SEMI JOIN b on (a.key = b.key)

查询与“刘晨”在同一个系学习的学生

select s1.Sname from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';

注意比较：

select * from student s1 left join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';select * from student s1 right join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';select * from student s1 inner join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';select * from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';select s1.Sname from student s1 right semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';

外部表

drop表时hdfs上的文件不会被删除

hive>create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string)>row format delimited fields terminated by ','>location '/input/hive/student/';

location是hdfs上的路径
导入数据：
与内部表的导入一样

分区表

可以是内部表，也可以是外部表

hive> create external table student_p(Sno int,Sname string,Sex string,Sage int,Sdept string)>partitioned by(part string) >row format delimited fields terminated by ','>location '/input/hive/partition';#创建分区：alter table student_p add if not exists partition (part ='usa');#删除分区：外部表时hdfs上的文件还是存在的alter table student_p drop if exists partition (part ='japan');#导入数据时决定分区hive> load data local inpath '/home/bingo/data/hive/students.txt' overwrite into table student_p    > partition (part='china');hive> load data local inpath '/home/bingo/data/hive/students.txt' overwrite into table student_p    > partition (part='japan');#查询全部select * from student_p;#查询分区select * from student_p where part='china';

详情看官网：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual

阅读全文

0 0