Hive常规操作总结

来源：互联网发布：2016中超网络直播编辑：程序博客网时间：2024/05/02 23:33
Hive目录的说明  ● bin:            包含了各种Hive服务的可执行文件例如CLI命令行界面  ● .hiverc:        位于用户的主目录下的文件，如果不存在可以创建一个        里边的命令可以在启动CLI时，会先自动执行！  ● metastore(元数据存储)：    Hive所需要的组件只有元数据信息是hadoop没有的，它存储    了表的模式和分区信息等元数据信息，用户在执行create table x..    或者alter table y...时会指定这些信息！    Hive会将元数据的信息存储到Mysql中  ● .hivehistory:        存储执行的历史命令1.Hive中的数据类型：基本数据类型：tinyint:1bytesmalint:2byteint:4bytebigint:8byteboolean:true or falsefloat:单精度浮点数double:双精度浮点数string:字符序列timestamp:整数，浮点数或者字符串binary：字节数组集合数据类型;struct：    struct('john', 'doe')map:    map('first','join','last','doe')array:    array('john', 'doe')------------------------示例----------------------create table employee(   name         string,   salary       float,   subordinates array<string>,      //数组类型   deductions   map<string, float>,     // map类型   address      struct<street:string, city:string, state:string, zip:int>   // 结构体类型)row format delimited          // 这组关键字是必须写在前面的fields terminated by '\t'   // 每列用‘\t’分隔collection items terminated by ',';     // 集合间的元素用，分隔// 说明：struct类型貌似和array类型的区别：//  struct类型里边可以拥有更多种的数据类型，//  array类型只有一种数据类型(string类型)2.Hive中关于库的概念以及操作  ● Hive中数据库的概念：        仅仅是存放表数据的一个目录或者命名空间  ● Hive会为每个数据库创建一个目录：        数据库中的表将会以这个数据库的子目录形式存储（default例外）        在/user/hive/warehouse/table_name/表文件    (位于HDFS中)  ● Hive不支持行级插入操作，更新操作和删除操作。也不支持事务----------------------操作数据库----------------------------hive>show tables;   // 显示当前工作目录下的表hive>show tables in mydb;   // 显示指定数据库下的表hive>show databases like 'h.*';     // 显示以h开头的数据库hive>create table person2 like person;      // 拷贝一张一模一样的表hive>create database mydb location '/home/wenpu_di/';   // 指定创建的数据库目录的存放位置 -> locationhive>create database mydb location '/home/diwenpu/' comment 'my database';  //  comment(注释部分): my database是关于这个数据库的说明文字hive>drop database mydb cascade;    // 删除非空的数据库（含有表）3.Hive中表的基本操作hive>create table table1(i int, name string);   // 建表操作hive>desc table1;   // 查看表结构hive>drop table table1;     // 删除表// 修改表，只会修改元数据的但不会修改数据本身hive>alter table mytable rename to MYTABLE;     // 重命名hive>alter table mytable add columns(name string, age int); // 添加新的字段hive>alter table add partition(year=2011, month=1, day=1) location '/log/data1';// 该命令只能操作分区表  ● 外部表：Hive不完全拥有该表，删除该表并不会删除掉表中的这份数据，不过描述表的元数据信息会被删除掉 //创建一个外部表：create extenal table stocks(    age int,    name string)row format delimited fields terminated by ','location '/data/stocks';    // 分号在结尾// 说明：// extenal:说明这个表是非分区外部表，后边的location：// 告诉Hive数据在那个路径下边hive>desc extended tablename;   // 显示表示管理表还是外部表// tableType:MANAGED_TABLE:管理表// tableType:EXTERNAL_TABLE:外部表  ● 分区表：    分区表改变了Hive对数据存储的组织方式。// 分区表的创建create table employees(    name string,    salary float,    subordinates array<string>,    deductions map<string, float>    address     struct<street:string, city:string, state:string, zip:int>)partitioned by (country string, state string);// 分区字段：country state 用户不需要关心这些字段是不是分区字段hive>describe extended employee;    // 查看分区键hive>show partitions employees;     // 查看表中存在的所有分区hive>show partitions employees partition(country='US'); //查看表中的多个分区country=US/state=ALcountry=US/state=AK...hive>show partitions employees partition(country='US', state='AL')  //查看表中的一个指定分区country=US/state=AL// 增加一个分区：alter table table_name add partition(...)hive>alter table log_message add partition(year=2012, month=1, day=2)location 'hdfs://master:9000/S1/data';// 将分区路径指向其他路径(修改分区路径)，改变表存储路径hive>alter table log_message partition(year=2011, month=12, day=2)set location 'hdfs://master:9000/S2/data'// 删除某个分区hive>alter table log_message drop partition(year=2011, month=12, day=2);// 将指定位置的数据拷贝到指定的分区下边去hive>load data local inpath '/home/wenpu_di/sougou.txt' into table employee partition(country='US', state='AL');// 注意：指定的路径下边的数据要和你设定的分区匹配才行// 即：它会去读取数据但是不保证里边的数据都符合制定分区的要求4.装载数据  ● load指定文件路径加载hive>load data local inpath '/home/wenpu_di/data.txt' overwrite into table employee partition(country='US', state='CA');说明：// inpath:指定的路径下不能含有文件夹// 如果目标分区不存在的话，那么先创建这个分区在将数据拷贝到目录下// 如果目标表是非分区表，那么语句中应该省略partition子句// local关键字// 注意：加上local：代表拷贝本地数据到HDFS上的目标位置// 不加local代表转移数据到目标位置// 即将"hdfs://master:9000/S/data"  转移到 “表所在的位置”  ● 单个查询语句中创建表并加载数据create table ca_employee as select name, salary, address from employee where state='CA';// 从一个大的宽表中选取需要的数据集，但是这个功能// 不能用于外部表。  ●  以一个表的查询结果作为另一个表的输入insert overwrite table sougou select uid,url from sougou500;5.join连接1.内连接：    select a.no,b.no from a join b on a.no = b.no;内连接只会保留两个表中相等的字段2.left outer join    select a.no,b.no from a left outer join b on a.no = b.no;左外连接：以左表为基准表，左表中不符合条件的记录也会保留但是对应的右表的是null即：左表的记录都会出现，不满足On条件的右边补null3.right outer join    select a.no,b.no from a right outer join b on a.no = b.no;右外连接：以右表为基准，右表中不符合条件的记录也会保留但是对应的左表的是null即：右表的记录都会出现，不满足On条件的左边补null4.full outer join    select a.no,b.no from a full outer join b on a.no = b.no;左右表中的记录都会出现，左表中不匹配的右表补null，右表中不匹配的左表补null5.left semi join     select a.no,b.no from a left semi join b on a.no = b.no;把，a表中在b表中出现的过记录统计出来：    即：以a表为基准只要a表中的记录在b表中出现过就统计出来（重复出现的只统计一次）6.map 连接在mapper的内存中执行连接操作select /*+MAPJOIN(a)*/ a.no,b.no from a join b on a.no = b.no;7.其他特殊操作#hive -e "desc database test";      // -e: 执行完这条语句会自动退出CLI#hive -S -e "desc database test" >>  /home/diwenpu/temp.txt // -S：开启静默模式去掉OK和用时； -e:立即退出CLI#hive -f /home/wenpu_di/hive.sql        // 运行一个sql命令集,-e和-f不能一起用#cat hive.sql   use database1;select * from sougou limit 5...
0 0