数据分析系统Hive(第二部分)

来源:互联网 发布:改良圈算法原理 编辑:程序博客网 时间:2024/04/29 02:38

第一:Hive的运行原理

一、Hive Cli(查询处理器)

1.Query Compiler

  • Parser(语法解析器,生成AST(抽象语法树))
  • Semantic Analyzer(语义分析器,生成QB(查询块))
  • Logical Plan Generator(逻辑查询计划生成器,生成QB Tree)
  • Logical Optimizer(逻辑查询优化器,生成QB Tree)
  • Physical Plan Generator(物理查询计划生成器,生成Phys Plan)
  • Physical Optimizer(物理查询优化器,生成 Phys Plan)

2.Exection Engine
- MR
- Spark
- Tez

二、MetaStore(元数据)

1.存储表的元数据信息

  • 数据库,表的基本信息
  • 分区信息
  • 列的信息
  • 存储格式信息
  • 各种属性信息
  • 权限信息

2.使用关系型数据库进行存储
3.MetaStore Server

  • 通用的Catalog Server
  • 支持多语言访问
  • local/remote Mode
  • 配置
   <property>        <name>hive.metastore.uris</name>        <value>thrift://hadoopA:8020</value>    </property>
  <property>        <name>javax.jdo.option.ConnectionURL</name>        <value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value>    </property>    <property>        <name>javax.jdo.option.ConnectionDriverName</name>        <value>com.mysql.jdbc.Driver</value>    </property>    <property>        <name>javax.jdo.option.ConnectionUserName</name>        <value>root</value>    </property>    <property>        <name>javax.jdo.option.ConnectionPassword</name>        <value>123</value>    </property>

三、YARN(计算平台)
四、HDFS(数据存储)

第二:Hive的访问方式

  1. Hive Cli

    • hvie
    • hive -h hotst -p port
  2. Hive Beeline

    • beeline -u jdbc:hive2://host:port user password
  3. 编程访问

    • JDBC/ODBC
    • Thrift

第三:Hive的查询优化

1. 合理建表

一、定义列类型

  • 基本类型
  • 负责类型

二、使用分区列

  • 查询维度,业务需求
  • 日期分区,动态分区
  • set hive.exec.dynamic.partition=true

第三、定义分桶列
第四、采用列式存储

  • Parquet
  • orc

第五、采用压缩存储

  • Bzip,Gzip
  • LZO
  • Snappy

2. 通用参数优化

第一、启用数据压缩

  • 减少存储和IO
  • 压缩Hive输出和中间结果
    • hive.exec.compress.output=true
    • hive.exec.compress.intermediate=true
  • 设置Hive中间表存储格式
    • hive.query.result.fileformat=SequenceFile

第二、Job执行优化

  • 并行执行多个job

    • hive.exec.parallel=true(default false)
    • hive.exec.parallel.thread.number=8(default 8)
  • 本地执行模式

    • hive.exec.mode.local.auto=true
    • hive.exec.mode.local.auto.inputbytes.max(128 by default)
    • hive.exec.mode.local.auto.tasks.max(4 by default)
    • num_reduce_task<=1

第三、选择合适引擎

  • MR
  • Spark
  • Tez

第四、MapReduce参数优化

  • Map阶段优化

    • mapreduce.job.maps无效
    • num_map_tasks切割参数影响大小
    • 切割算法
  • Reduce阶段优化

    • mapred.reduce.tasks直接设置
    • num_reduce_tasks大小影响参数
    • 切割算法
  • Shuffle阶段优化

    • 压缩中间数据
    • 配置方法

      • mapreduce.map.output.compress=true
      • mapreduce.map.output.compress.codec

      • org.apache.hadoop.io.compress.LzoCodec

      • org.apache.hadoop.io.compress.SnappyCodec

3. Join优化

第一、Common join
在Reduce端做join
第二、Map join

  • 将小文件放到内存,和大文件的每一个map在mapper阶段进行join
  • hive.auto.convert.join=true(default false)
  • hive.mapjoin.smalltable.filesize=600M(default 25M)
  • Select /+MAPJOIN(a)/..a JOIN b 强制指定mapjoin

第三、Bucket map join

  • set hive.optimize.bucketmapjoin=true
  • mapjoin 一起工作
  • 所有要join的表必须分桶,大桶的表的个数是小桶的表的整数倍
  • 做了bucket的列必须是等于join的列

第四:项目实践

第一、项目说明

  • 搭建好Hadoop集群环境。安装好hdfs,yarn,hive等组件
  • 分布创建三张Hive表。user_dimension,brand_dimension,record.其中user_dimension和brand_dimension是外部表。
  • 向这三张表中导入数据。
  • 数据来源均已准备好,并且和三张表的格式对应

第二、创建user_dimension,brand_dimension,record这三张表

[hadoop@hadoopa command]$ cat create_table_user.sqlcreate external table if not exists user_dimension ( uid STRING, name STRING, gender STRING, birth DATE, province STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/user_dimension' ;[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_brand.sqlcreate external table if not exists brand_dimension ( bid STRING, category STRING, brand STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ;[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_record.sqlcreate table if not exists record ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
[hadoop@hadoopa command]$ hive -f create_table_user.sql[hadoop@hadoopa command]$ hive -f create_table_brand.sql[hadoop@hadoopa command]$ hive -f create_table_record.sql

第三、向这三张表中导入数据

hive> load data local inpath '/home/hadoop/hadooptraining/datasource/brand.list' overwrite into table brand_dimension;hive> load data local inpath '/home/hadoop/hadooptraining/datasource/user.list' overwrite into table user_dimension;hive> load data inpath 'hdfs://hadoopA:8020/flume/record/2017-03-10/2220/transaction_log.1489155600805' overwrite into table record partition(partition_date='2017-03-10',hour=22);

第四、使用HQL命令验证试验结果

  • 查询各品牌销售总额

    1. HQL语句
[hadoop@hadoopa command]$ cat brand_price_list.sqlselect brand,sum(price) as totalPricefrom record join brand_dimension on record.bid=brand_dimension.bidgroup by brand_dimension.brandorder by totalPrice desc;[hadoop@hadoopa command]$
2. 执行结果
[hadoop@hadoopa command]$ hive -f brand_price_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313114243_fdc4d60e-75f7-426e-ba69-cf7198eaedc6Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:43:12     Starting to launch local task to process map join;     maximum memory = 5189795842017-03-13 11:43:15     Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:43:16     Uploaded 1 File to: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (34008 bytes)2017-03-13 11:43:16     End of local task; Time Taken: 3.194 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0031, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0031/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0031Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:43:40,015 Stage-2 map = 0%,  reduce = 0%2017-03-13 11:43:54,259 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.86 sec2017-03-13 11:44:08,440 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.47 secMapReduce Total cumulative CPU time: 5 seconds 470 msecEnded Job = job_1489117385526_0031Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0032, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0032/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0032Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:44:32,410 Stage-3 map = 0%,  reduce = 0%2017-03-13 11:44:52,170 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec2017-03-13 11:45:07,571 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.73 secMapReduce Total cumulative CPU time: 5 seconds 730 msecEnded Job = job_1489117385526_0032MapReduce Jobs Launched:Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 5.47 sec   HDFS Read: 62768 HDFS Write: 1666 SUCCESSStage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.73 sec   HDFS Read: 6942 HDFS Write: 1480 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 200 msecOKSKYWORTH        11992SAMSUNG 10240YILI    9872TCL     6741OLAY    6442MEIZU   6345ASUS    5705PEAK    5431APPLE   5213MOUTAI  4772SHARP   4721PEACEBIRD       4680MIZUNO  4599DHC     4585NIULANSHAN      4582CAMEL   4569NIKE    4358SEPTWOLVES      4345OPPO    4306NB      4237KAPPA   4092ZARA    4068GUANGMING       4054HP      4043HISENSE 3995HLA     3963HUAWEI  3927KANS    3884LANGJIU 3857NIVEA   3579LINING  3559CLINIQUE        3552LENOVO  3534PUMA    3531HTC     3405GXG     3322UNIQLO  3271HAIER   3106LOREAL  2948WULIANGYE       2912ADIDAS  2847MOTOROLA        2819VIVO    2809DELL    2785SANYUAN 2776LANCOME 2714SELECTED        2633INNISFREE       2452SONY    2353ACER    2339XIAOMI  2260HONGXING        2113ANTA    1990MENGNIU 1776IPHONE  1628SEMIR   1589PHILIPS 1205361     718MEIFUBAO        448Time taken: 146.0 seconds, Fetched: 59 row(s)
  • 查询个年龄段用户消费总额

    1. HQL语句
[hadoop@hadoopa command]$ cat age_price_list.sqlselect cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) as age,sum(price) as totalPricefrom record join user_dimension on record.uid=user_dimension.uidgroup by cast(DATEDIFF(CURRENT_DATE, birth)/365 as int)order by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f age_price_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313114910_f92cc413-d572-4574-a30e-2cfa9fdcded8Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:49:32     Starting to launch local task to process map join;     maximum memory = 5189795842017-03-13 11:49:35     Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:49:35     Uploaded 1 File to: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (30382 bytes)2017-03-13 11:49:35     End of local task; Time Taken: 3.073 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0033, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0033/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0033Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:49:57,659 Stage-2 map = 0%,  reduce = 0%2017-03-13 11:50:18,269 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 4.14 sec2017-03-13 11:50:32,451 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 6.71 secMapReduce Total cumulative CPU time: 6 seconds 710 msecEnded Job = job_1489117385526_0033Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0034, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0034/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0034Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:50:50,851 Stage-3 map = 0%,  reduce = 0%2017-03-13 11:51:06,025 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.22 sec2017-03-13 11:51:19,065 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.14 secMapReduce Total cumulative CPU time: 5 seconds 140 msecEnded Job = job_1489117385526_0034MapReduce Jobs Launched:Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 6.71 sec   HDFS Read: 64575 HDFS Write: 1104 SUCCESSStage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.14 sec   HDFS Read: 6338 HDFS Write: 1037 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 850 msecOK46      965344      943610      80682       791815      74709       745640      739114      738843      710937      66716       598011      596918      585830      585329      584122      581939      573720      559713      556441      550321      530612      499823      499128      483035      482933      452819      434725      410936      401732      391017      36981       34687       345916      334426      332831      32863       31884       31418       263942      255745      247324      245734      245447      225427      213738      17965       15340       1232Time taken: 131.25 seconds, Fetched: 48 row(s)
  • 查询个省份消费总额

    1. HQL语句
[hadoop@hadoopa command]$ cat province_prince_list.sqlselect province,sum(price) as totalPricefrom record join user_dimension on record.uid=user_dimension.uidgroup by user_dimension.provinceorder by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f province_prince_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313115315_d285100f-73a1-44b4-8ffd-9537cbea48e9Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:53:44     Starting to launch local task to process map join;     maximum memory = 5189795842017-03-13 11:53:48     Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:53:48     Uploaded 1 File to: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (35297 bytes)2017-03-13 11:53:48     End of local task; Time Taken: 4.051 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0035, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0035/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0035Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:54:20,334 Stage-2 map = 0%,  reduce = 0%2017-03-13 11:54:45,944 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 5.22 sec2017-03-13 11:55:06,202 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 9.18 secMapReduce Total cumulative CPU time: 9 seconds 180 msecEnded Job = job_1489117385526_0035Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0036, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0036/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0036Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:55:31,708 Stage-3 map = 0%,  reduce = 0%2017-03-13 11:55:51,447 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.66 sec2017-03-13 11:56:15,088 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 6.75 secMapReduce Total cumulative CPU time: 6 seconds 750 msecEnded Job = job_1489117385526_0036MapReduce Jobs Launched:Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 9.18 sec   HDFS Read: 62835 HDFS Write: 1045 SUCCESSStage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 6.75 sec   HDFS Read: 6333 HDFS Write: 938 SUCCESSTotal MapReduce CPU Time Spent: 15 seconds 930 msecOKZheJiang        11501ShanDong        11336LiaoNing        10519JiLin   10341FuJian  9742XiangGang       9371QingHai 9329ShanXi3 9283GuiZhou 9148HaiNan  9037HuBei   8868NeiMengGu       8245BeiJing 8035JiangXi 7706AnHui   7382HuNan   6666ShangHai        6441JiangSu 6025TaiWan  5988ShanXi1 5862ChongQing       5854XinJiang        5557HeNan   5390XiZang  5322TianJin 4984YunNan  4695GuangXi 4646HeiLongJiang    4639GanSu   4288GuangDong       3753SiChuan 3627Aomen   3233NingXia 2444HeBei   1334Time taken: 182.306 seconds, Fetched: 34 row(s)

第五. 创建user_dimension_orc,brand_dimension_orc,record_orc这三张表。这三张表使用ORC格式保存数据

[hadoop@hadoopa command]$ cat create_table_user_orc.sqlcreate external table if not exists user_dimension_orc ( uid STRING, name STRING, gender STRING, birth DATE, province STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/user_dimension' ;
[hadoop@hadoopa command]$ cat create_table_brand_orc.sqlcreate external table if not exists brand_dimension_orc ( bid STRING, category STRING, brand STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ;
[hadoop@hadoopa command]$ cat create_table_record_orc.sqlcreate table if not exists record_orc ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;

第六.将数据导入到三张ORC表中

hive> insert into table brand_dimension_orc select * from brand_dimension;
hive> insert into table user_dimension_orc select * from user_dimension;
hive> insert overwrite table record_orc partition (partition_date='2017-03-10',hour=22) select rid,uid,bid,trancation_date,price,source_province,target_province,site,express_number,express_company from record
0 0
原创粉丝点击