Hive_表操作

来源:互联网 发布:西方媒体抹黑中国知乎 编辑:程序博客网 时间:2024/05/20 23:31
  1. Hive架构
    The Hive metadata store (or called metastore) can use either embedded,local, or remote databases. Hive servers are built on Apache Thrift Server technology.

  2. Hive适用derby作为默认元数据存储(仅支持一个客户端在线),以下方式将元数据库设置为MySQL:

    javax.jdo.option.ConnectionURL // the database URLjavax.jdo.option.ConnectionDriverName // the JDBC driver namejavax.jdo.option.ConnectionUserName // database usernamejavax.jdo.option.ConnectionPassword // database password

    Make sure the MySQL JDBC driver is available at $HIVE_HOME/lib
    An external database offers a shared service so that users can share the metadata of Hive. However, an embed database is only visible to the local users.

  3. Hive数据类型

    ① 简单数据类型:TINYINT、SMALLINT、INT、BIGINT、FLOAT、DOUBLE、DECIMAL(DECIMAL (3,2) for 3.14)、BINARY、BOOLEAN、STRING、CHAR、VARCHAR、DATE(’2013-01-01’)、
    TIMESTAMP(’2013-01-01 12:00:01.345’)

    ②复杂数据类型:

    Complex data type             ExampleARRAY                ['apple','orange','mango']MAP                   {1: "apple",2: "orange"}STRUCT     such as {val1, val2, val3, and so on}.By default, STRUCT field names will be col1,col2, and so on. Using structs_name.column_name, for example,fruit.col1=1.          {1, "apple"}NAMED STRUCT           such as (name1, val1, name2,val2, and so on). Using structs_name.column_name, for example, fruit.apple="gala".          {"apple":"gala","weightkg":1}UNION             This is a structure that has exactly any one of the specified data types.                                                    {2:["apple","orange"]}
  4. 实例:

    ①准备数据:
    vi employee.txt
    Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
    Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
    Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
    Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

    ②通过Beeline连接Hive:

    beeline -u 'jdbc:hive2://192.168.6.186:10000/silence1101;principal=hadoop/namenode@SILENCE.COM'

    ③Create a table using ARRAY , MAP , and STRUCT composite data types:

    CREATE TABLE employee (name string,work_place ARRAY<string>,sex_age STRUCT<sex:string,age:int>,skills_score MAP<string,int>,depart_title MAP<string,ARRAY<string>>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ','MAP KEYS TERMINATED BY ':';

    ④Verify the table’s creation:

    !table employee!column employee

    ⑤载入数据:

      LOAD DATA LOCAL INPATH '/home/hadoop/hive/employee.txt' OVERWRITE INTO TABLE employee;

    ⑥ Query the whole array and each array column in the table:

      SELECT work_place FROM employee;  SELECT work_place[0] AS col_1,work_place[1] AS col_2, work_place[2] AS col_3 FROM employee;

    ⑦ Query the whole struct and each struct column in the table:

      SELECT sex_age FROM employee;  SELECT sex_age.sex, sex_age.age FROM employee;

    ⑧ Query the whole map and each map column in the table:

      SELECT skills_score FROM employee;  SELECT name, skills_score['DB'] AS DB,skills_score['Perl'] AS Perl,skills_score['Python'] AS Python,skills_score['Sales'] as Sales,skills_score['HR'] as HR FROM employee;

    ⑨ Query the composite type in the table:

      SELECT depart_title FROM employee;  SELECT name,depart_title['Product'] AS Product,depart_title['Test'] AS Test,depart_title['COE'] AS COE,depart_title['Sales'] AS Sales FROM employee;  SELECT name,depart_title['Product'][0] AS product_col0,depart_title['Test'][0] AS test_col0 FROM employee;
  5. Hive Data Definition Language(DDL)
    Most Hive DDL statements start with the keywords CREATE , DROP , or ALTER . The comments in Hive start from –
    ① Hive database
    Whenever a new database is created, Hive creates a directory for each database at /user/hive/warehouse ,
    defined in hive.metastore.warehouse.dir

    CREATE DATABASE IF NOT EXISTS myhivebook;CREATE DATABASE IF NOT EXISTS myhivebook COMMENT 'hive database demo' LOCATION '/hdfs/directory' WITH DBPROPERTIES ('creator'='dayongd','date'='2015-01-01');SHOW DATABASES LIKE 'my.*';DESCRIBE DATABASE default;USE myhivebook;DROP DATABASE IF EXISTS myhivebook;

    【Note that In order to remove the parent directory, we need to remove the subdirectories first. But,CASCADE drops the tables in the database automatically before dropping the database.】

    DROP DATABASE IF EXISTS myhivebook CASCADE;ALTER DATABASE myhivebook SET DBPROPERTIES ('edited-by' = 'Dayong');ALTER DATABASE myhivebook SET OWNER user dayongd;

    ② Hive internal and external tables

    CREATE EXTERNAL TABLE employee_external (name string,work_place ARRAY<string>,sex_age STRUCT<sex:string,age:int>,skills_score MAP<string,int>,depart_title MAP<string,ARRAY<string>>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ','MAP KEYS TERMINATED BY ':';

    【Note:For the STORE AS property, it is set to AS TEXTFILE by default. Other file format values, such as SEQUENCEFILE, RCFILE, ORC, AVRO , and PARQUET can also be specified.】

    CREATE TABLE ctas_employee AS SELECT * FROM employee_external;(CTAS)SELECT COUNT(*) AS row_cnt FROM empty_ctas_employee;DROP TABLE IF EXISTS empty_ctas_employee;TRUNCATE TABLE cte_employee;!table   #获取所有表的元数据信息ALTER TABLE cte_employee RENAME TO c_employee;ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments');ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$');ALTER TABLE c_employee SET FILEFORMAT RCFILE;ALTER TABLE c_employee SET LOCATION ‘hdfs://namenode:9000/user/dayongd/employee';ALTER TABLE c_employee ENABLE NO_DROP;ALTER TABLE c_employee DISABLE NO_DROP;ALTER TABLE c_employee ENABLE OFFLINE;ALTER TABLE c_employee DISABLE OFFLINE;ALTER TABLE c_employee SET FILEFORMAT ORC;ALTER TABLE c_employee CONCATENATE;ALTER TABLE c_employee SET FILEFORMAT TEXTFILE;

    【Note: Hive release 0.14.0 ORC, the files that are added support fast stripe-level merging of small ORC files using the CONCATENATE command.】

    DESC employee_internal;ALTER TABLE employee_internal CHANGE name employee_name string AFTER sex_age;ALTER TABLE employee_internal CHANGE employee_name name string FIRST;ALTER TABLE c_employee ADD COLUMNS (work string);ALTER TABLE c_employee REPLACE COLUMNS (name string);

    【Note:The ALTER command will only modify Hive’s metadata, NOT the data. Users should make sure the actual data conforms with the metadata definition manually.】

  6. 分区表
    –Create partitions when creating tables

    CREATE TABLE employee_partitioned (name string,work_place ARRAY<string>,sex_age STRUCT<sex:string,age:int>,skills_score MAP<string,int>,depart_title MAP<STRING,ARRAY<STRING>>) PARTITIONED BY (Year INT, Month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';SHOW PARTITIONS employee_partitioned;

    【We have to use ALTER TABLE ADD PARTITION to add partitions to a table.】
    –Add multiple partitions

    ALTER TABLE employee_partitioned ADD PARTITION (year=2014, month=11) PARTITION (year=2014, month=12);

    –Drop the partition

    ALTER TABLE employee_partitioned DROP IF EXISTS PARTITION (year=2014, month=11);

    【To drop the partition including both data and metadata】
    –To load or overwrite data in partition, we can use the LOAD or INSERT OVERWRITE statements.

    LOAD DATA LOCAL INPATH '/home/hadoop/hive/employee.txt' OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2014, month=12);SELECT name, year, month FROM employee_partitioned;
  7. 桶表
    By using buckets, Hive can easily and efficiently do sampling and map side joins
    –Prepare another dataset and table for bucket table

    CREATE TABLE employee_id (name string,employee_id int,work_place ARRAY<string>,sex_age STRUCT<sex:string,age:int>,skills_score MAP<string,int>,depart_title MAP<string,ARRAY<string>>)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';LOAD DATA LOCAL INPATH '/home/hadoop/hive/employee_id.txt' OVERWRITE INTO TABLE employee_id;

    –Create the buckets table

    CREATE TABLE employee_id_buckets (name string,employee_id int,work_place ARRAY<string>,sex_age STRUCT<sex:string,age:int>,skills_score MAP<string,int>,depart_title MAP<string,ARRAY<string >>)CLUSTERED BY (employee_id) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';

    【Bucket numbers:To define the proper number of buckets, we should avoid having too much or too little of data in each bucket. For example, we can plan 512MB of data in each bucket, if the Hadoop block size is 256 MB. If possible, use 2N as the number of buckets.】

    set map.reduce.tasks = 2;set hive.enforce.bucketing = true;INSERT OVERWRITE TABLE employee_id_buckets SELECT * FROM employee_id;

    –Verify the buckets in the HDFS

    hdfs dfs -ls /user/hive/warehouse/employee_id_buckets
  8. 视图
    Unlike some RDBMS, Hive views do not store data or get materialized.

    CREATE VIEW employee_skills AS SELECT name, skills_score['DB'] AS DB,skills_score['Perl'] AS Perl,skills_score['Python'] AS Python,skills_score['Sales'] as Sales,skills_score['HR'] as HR FROM employee;

    Use SHOW CREATE TABLE or DESC FORMATTED TABLE to display the CREATE VIEW statement that created a view. The following are other Hive view DDLs:

    ALTER VIEW employee_skills SET TBLPROPERTIES ('comment' = 'This is a view');ALTER VIEW employee_skills AS SELECT * from employee ;DROP VIEW employee_skills;
  9. Data Manipulation
    ① Data exchange – LOAD

    LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee_hr.txt' OVERWRITE INTO TABLE employee_hr;LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt' OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2014, month=12);LOAD DATA INPATH '/user/dayongd/employee/employee.txt' OVERWRITE INTO TABLE employee;LOAD DATA INPATH 'hdfs://[dfs_host]:8020/user/dayongd/employee/employee.txt' OVERWRITE INTO TABLE employee;

    ② Data exchange – INSERT

    INSERT INTO TABLE employee SELECT * FROM ctas_employee;WITH a AS (SELECT * FROM ctas_employee ) FROM a INSERT OVERWRITE TABLE employee SELECT *;FROM ctas_employee INSERT OVERWRITE TABLE employee SELECT * INSERT OVERWRITE TABLE employee_internal SELECT * ;

    Dynamic partition is not enabled by default. We need to set the following properties to make it work:

    SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;INSERT INTO TABLE employee_partitioned PARTITION(year, month) SELECT name, array('Toronto') as work_place, named_struct("sex","Male","age",30) as sex_age,map("Python",90) as skills_score,map("R&D",array('Developer')) as depart_title,year(start_date) as year, month(start_date) as month FROM employee_hr eh WHERE eh.employee_id = 102;
原创粉丝点击