Hive的操作

来源：互联网发布：怎么看淘宝销售排名编辑：程序博客网时间：2024/05/18 23:52

Hive复合数据类型：array，map，struct

   create table tblName (
       id int commment "id",
       name string comment 'name',
       hobby array<string>,
       score map<string, double>,
       address struct<province:string, city:string, zip:int>
   ) row format delimited
   fields terminated by "\t"
   collection items terminated by ","
   map keys terminated by ":"
   lines terminated by "\n";

--------------------------------------------------------------------------------------------------------------------------------------------------------
Hive中表的分类：
   managed_table：受控表、管理表、内部表
       表中的数据的生命周期/存在与否，受到了表结构的影响，当表结构被删除的，表中的数据随之一并被删除。

   external_table：外部表
       表中的数据的生命周期/存在与否，不受到了表结构的影响，当表结构被删除的，表中对应数据依然存在。
       这相当于只是表对相应数据的引用。

       创建外部表
           create external table t6_external(
               id int
           );

       增加数据：
           alter table t6_external set location "/input/hive/hive-t6.txt";

       还可以在创建外部表的时候就可以指定相应数据
           create external table t6_external(
               id int
           ) location "/input/hive/hive-t6.txt";

       上述hql报错：
           MetaException(message:hdfs://ns1/input/hive/hive-t6.txt is not a directory or unable to create one
           意思是说在创建表的时候指定的数据，不期望为一个具体文件，而是一个目录
           create external table t6_external_1(
               id int
           ) location "/input/hive/";

       内部表和外部表的简单用途区别：
           当考虑到数据的安全性的时候，或者数据被多部门协调使用的，一般用到外部表。
           当考虑到hive和其它框架(比如hbase)进行协调集成的时候，一般用到外部表。

       可以对内部表和外部表进行互相转换：
       外部表--->内部表：alter table t6_external set tblproperties("EXTERNAL"="FALSE");
       内部表---->外部表alter table t2 set tblproperties("EXTERNAL"="TRUE");
--------------------------------------------------------------------------------------------------------------------------------------------------------
功能表：
1.分区表
创建一张分区表：
           create table t7_partition (
               id int
           ) partitioned by (dt date comment "date partition field");
   添加数据： load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition;

           FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned
           不能直接向分区表加载数据，必须在加载数据之前明确加载到哪一个分区中，也就是子文件夹中。
           分区表的DDL：
               创建一个分区：
                   alter table t7_partition add partition(dt="2017-03-10");
               查看分区列表：
                   show partitions t7_partition;
               删除一个分区：
                   alter table t7_partition drop partition(dt="2017-03-10");
           增加数据：
               向指定分区中增加数据：
                   load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition partition (dt="2017-03-10");
                   这种方式，会自动创建分区
       有多个分区字段的情况：
           统计学校，每年，每个学科的招生，就业的情况/每年就业情况
           create table t7_partition_1 (
               id int
           ) partitioned by (year int, school string);

    添加数据： load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition_1 partition(year=2015, school='python');
2.桶表
       因为分区表还有可能造成某些分区数据非常大，某些则非常小，造成查询不均匀，这不是我们所预期，
       就需要使用一种技术，对这些表进行相对均匀的打散，把这种技术称之为分桶，分桶之后的表称之为桶表。
       创建一张分桶表：
           create table t8_bucket(
               id int
           ) clustered by(id) into 3 bucket;
       向分桶表增加数据：
           只能从表的表进行转换，不能使用上面的load这种方式（不会对数据进行拆分）
           insert into t8_bucket select * from t7_partition_1 where year=2016 and school="mysql";
           FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different 't8_bucket':
           Table insclause-0 has 1 columns, but query has 3 columns.
           我们的桶表中只有一个字段，但是分区表中有3个字段，所以在使用insert into 的方式导入数据的时候，
           一定要注意前后字段个数必须保持一致。
           insert into t8_bucket select id from t7_partition_1 where year=2016 and school="mysql";

           操作分桶表的时候，本地模式不起作用。

           分桶原则：分桶算法----哈希算法
----------------------------------------------------------------------------------------------------------------------------------------------------------
sql中的函数：
   对某一特定功能的封装，称之为函数
   count、sum、max、avg、min

   count：计数
   case when：相当于java的if else或者switch语句，条件判断和作出相应的输出
   split(str, regex)：
          hive > SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;
          ["one", "two", "three"]

   explode：
       hive> desc function extended explode;
       OK
       explode(a) - separates the elements of array a into multiple rows

   collect_set：将一个列表转变成为一个set集合
   array：将多个给定元素转变成为一个数组
   row_number()开窗函数，主要做二次排序
----------------------------------------------------------------------------------------------------------------------------------------------------------
case when实例：查看每个id所对应的部门
   case when

       select
       id, case id
          when 1 then "desgin"
          when 2 then "develop"
          when 3 then "market"
          when 4 then "sale"
          else "other"
          end
       from t8_bucket;

   wordcount实例：
       使用hql来统计每一个单词出现的次数
       create table test(line string);
       load data local inpath '/opt/data/hive/hive-f-2.txt' into table test;
       select * from test;
           hello you
           hello me
           hello he
       统计结果：
           hello 3
           you 1
           me 1
           he   1
       第一步：使用split函数，对test表中的line这一列，进行拆分，将字符串转换成为一个字符串数组。
           使用split---->select split(line, " ") from test;
       ====>
           ["hello","you"]
           ["hello","me"]
           ["hello","he"]
       第二步：因为要统计出每一个单词出现的次数，所我们要是用count函数，但是count函数只能操作单列，
               所以我们有必要将一个数组中的内容，列转行，转换成多行数据，这每一行的数据，是数组中的一列元素
               这里使用hive的表函数explode
           select explode(split(line, " ")) from test;
       ====>
           hello
           hello
           hello
           you
           he
           me
       第三步：在将数据转化成为多行的基础之上在使用count和group by就能完成统计任务。
           select
               t.word, count(t.word) as count
           from (
               select explode(split(line, " ")) word from test
           ) t
           group by t.word;

row_number()二次排序实例：

       //部门表
       create table if not exists t_dept(
           id int comment "部门ID",
           name string comment "部门名称"
       ) comment "员工部门表"
       row format delimited
       fields terminated by '\t';

       //员工信息表
       create table if not exists t_employee(
           id int comment "员工ID",
           deptid int comment "部门",
           name string comment "员工名称",
           age int comment "员工姓名",
           sex int comment "员工性别,1表示男，0表示女",
           phone bigint comment "联系电话",
           hometown string comment "籍贯"
       ) comment "员工信息表"
       row format delimited
       fields terminated by '\t';

       //员工薪资表
       create table if not exists t_salary(
           id int comment "信息ID",
           empid int comment "员工ID",
           salary float comment "员工薪资"
       ) comment "员工薪资表"
       row format delimited
       fields terminated by '\t';

       需求：按照部门统计员工的薪资，由高到低进行排序
       第一步：将所有表中员工的所有信息都罗列到一张表中
       select
           e.id emp_id, e.name as emp_name, e.sex as emp_sex, d.name as dept_name,
           s.salary as emp_salary
       from t_dept d left join t_employee e on d.id = e.deptid
       left join t_salary s on e.id = s.empid;

       //修改其中1为男，0为女，过滤其中的NULL
       select
           e.id emp_id, e.name as emp_name, if(e.sex == 1, "男", "女") as emp_sex, d.name as dept_name,
           s.salary as emp_salary
       from t_dept d left join t_employee e on d.id = e.deptid
       left join t_salary s on e.id = s.empid
       where s.salary is not null;

       //二次排序
       select
           e.id emp_id, e.name as emp_name, if(e.sex == 1, "男", "女") as emp_sex, d.name as dept_name,
           s.salary as emp_salary,
           row_number() over(partition by d.id order by s.salary) as rank
       from t_dept d left join t_employee e on d.id = e.deptid
       left join t_salary s on e.id = s.empid
       where s.salary is not null

       在此基础之上，求出每个部门薪资的top2
           select
               tmp.*
           from (
               select
                   e.id emp_id, e.name as emp_name, if(e.sex == 1, "男", "女") as emp_sex, d.name as dept_name,
                   s.salary as emp_salary,
                   row_number() over(partition by d.id order by s.salary) as rank
               from t_dept d left join t_employee e on d.id = e.deptid
               left join t_salary s on e.id = s.empid
               where s.salary is not null
           ) tmp
           where tmp.rank < 3;
----------------------------------------------------------------------------------------------------------------------------------
创建自定义的函数
   hive内嵌的函数，虽然说功能非常的强大，但是我们的业务可能是千变万化的，所以需要针对业务自定义函数！
       步骤：
           1.自定义UDF extends org.apache.hadoop.hive.ql.exec.UDF
           2.需要实现evaluate函数，evaluate函数支持重载
           3.把程序打包放到目标机器上去
           4.进入hive客户端，添加jar包：hive>add jar jar路径
           5.创建临时函数：hive> create temporary function 自定义名称 AS '自定义UDF的全类名'
           6.执行HQL语句；
           7.销毁临时函数：hive> drop temporary function 自定义名称


修改hive注释中文乱码的问题
在hive的元数据库中进行修改
   修改的是表字段的注释：
       alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
       alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
   修改表分区的注释：
       alter table PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8 ;
       alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
   修改索引注释：
       alter table INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
   完成以后需要修改我们的jdbc的连接地址，在其中加上设置字符编码为utf8
       hive-site.xml
       <property>
           <name>javax.jdo.option.ConnectionURL</name>
           <value>jdbc:mysql://IP:3306/db_name?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
           <description>JDBC connect string for a JDBC metastore</description>
       </property

0 0