数据分析系统Hive(第二部分)
来源:互联网 发布:改良圈算法原理 编辑:程序博客网 时间:2024/04/29 02:38
第一:Hive的运行原理
一、Hive Cli(查询处理器)
1.Query Compiler
- Parser(语法解析器,生成AST(抽象语法树))
- Semantic Analyzer(语义分析器,生成QB(查询块))
- Logical Plan Generator(逻辑查询计划生成器,生成QB Tree)
- Logical Optimizer(逻辑查询优化器,生成QB Tree)
- Physical Plan Generator(物理查询计划生成器,生成Phys Plan)
- Physical Optimizer(物理查询优化器,生成 Phys Plan)
2.Exection Engine
- MR
- Spark
- Tez
二、MetaStore(元数据)
1.存储表的元数据信息
- 数据库,表的基本信息
- 分区信息
- 列的信息
- 存储格式信息
- 各种属性信息
- 权限信息
2.使用关系型数据库进行存储
3.MetaStore Server
- 通用的Catalog Server
- 支持多语言访问
- local/remote Mode
- 配置
<property> <name>hive.metastore.uris</name> <value>thrift://hadoopA:8020</value> </property>
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> </property>
三、YARN(计算平台)
四、HDFS(数据存储)
第二:Hive的访问方式
Hive Cli
- hvie
- hive -h hotst -p port
Hive Beeline
- beeline -u jdbc:hive2://host:port user password
编程访问
- JDBC/ODBC
- Thrift
第三:Hive的查询优化
1. 合理建表
一、定义列类型
- 基本类型
- 负责类型
二、使用分区列
- 查询维度,业务需求
- 日期分区,动态分区
- set hive.exec.dynamic.partition=true
第三、定义分桶列
第四、采用列式存储
- Parquet
- orc
第五、采用压缩存储
- Bzip,Gzip
- LZO
- Snappy
2. 通用参数优化
第一、启用数据压缩
- 减少存储和IO
- 压缩Hive输出和中间结果
- hive.exec.compress.output=true
- hive.exec.compress.intermediate=true
- 设置Hive中间表存储格式
- hive.query.result.fileformat=SequenceFile
第二、Job执行优化
并行执行多个job
- hive.exec.parallel=true(default false)
- hive.exec.parallel.thread.number=8(default 8)
本地执行模式
- hive.exec.mode.local.auto=true
- hive.exec.mode.local.auto.inputbytes.max(128 by default)
- hive.exec.mode.local.auto.tasks.max(4 by default)
- num_reduce_task<=1
第三、选择合适引擎
- MR
- Spark
- Tez
第四、MapReduce参数优化
Map阶段优化
- mapreduce.job.maps无效
- num_map_tasks切割参数影响大小
- 切割算法
Reduce阶段优化
- mapred.reduce.tasks直接设置
- num_reduce_tasks大小影响参数
- 切割算法
Shuffle阶段优化
- 压缩中间数据
配置方法
- mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.LzoCodec
- org.apache.hadoop.io.compress.SnappyCodec
3. Join优化
第一、Common join
在Reduce端做join
第二、Map join
- 将小文件放到内存,和大文件的每一个map在mapper阶段进行join
- hive.auto.convert.join=true(default false)
- hive.mapjoin.smalltable.filesize=600M(default 25M)
- Select /+MAPJOIN(a)/..a JOIN b 强制指定mapjoin
第三、Bucket map join
- set hive.optimize.bucketmapjoin=true
- mapjoin 一起工作
- 所有要join的表必须分桶,大桶的表的个数是小桶的表的整数倍
- 做了bucket的列必须是等于join的列
第四:项目实践
第一、项目说明
- 搭建好Hadoop集群环境。安装好hdfs,yarn,hive等组件
- 分布创建三张Hive表。user_dimension,brand_dimension,record.其中user_dimension和brand_dimension是外部表。
- 向这三张表中导入数据。
- 数据来源均已准备好,并且和三张表的格式对应
第二、创建user_dimension,brand_dimension,record这三张表
[hadoop@hadoopa command]$ cat create_table_user.sqlcreate external table if not exists user_dimension ( uid STRING, name STRING, gender STRING, birth DATE, province STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/user_dimension' ;[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_brand.sqlcreate external table if not exists brand_dimension ( bid STRING, category STRING, brand STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ;[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_record.sqlcreate table if not exists record ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
[hadoop@hadoopa command]$ hive -f create_table_user.sql[hadoop@hadoopa command]$ hive -f create_table_brand.sql[hadoop@hadoopa command]$ hive -f create_table_record.sql
第三、向这三张表中导入数据
hive> load data local inpath '/home/hadoop/hadooptraining/datasource/brand.list' overwrite into table brand_dimension;hive> load data local inpath '/home/hadoop/hadooptraining/datasource/user.list' overwrite into table user_dimension;hive> load data inpath 'hdfs://hadoopA:8020/flume/record/2017-03-10/2220/transaction_log.1489155600805' overwrite into table record partition(partition_date='2017-03-10',hour=22);
第四、使用HQL命令验证试验结果
查询各品牌销售总额
- HQL语句
[hadoop@hadoopa command]$ cat brand_price_list.sqlselect brand,sum(price) as totalPricefrom record join brand_dimension on record.bid=brand_dimension.bidgroup by brand_dimension.brandorder by totalPrice desc;[hadoop@hadoopa command]$
2. 执行结果
[hadoop@hadoopa command]$ hive -f brand_price_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313114243_fdc4d60e-75f7-426e-ba69-cf7198eaedc6Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:43:12 Starting to launch local task to process map join; maximum memory = 5189795842017-03-13 11:43:15 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:43:16 Uploaded 1 File to: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (34008 bytes)2017-03-13 11:43:16 End of local task; Time Taken: 3.194 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0031, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0031/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0031Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:43:40,015 Stage-2 map = 0%, reduce = 0%2017-03-13 11:43:54,259 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec2017-03-13 11:44:08,440 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 5.47 secMapReduce Total cumulative CPU time: 5 seconds 470 msecEnded Job = job_1489117385526_0031Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0032, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0032/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0032Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:44:32,410 Stage-3 map = 0%, reduce = 0%2017-03-13 11:44:52,170 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.4 sec2017-03-13 11:45:07,571 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.73 secMapReduce Total cumulative CPU time: 5 seconds 730 msecEnded Job = job_1489117385526_0032MapReduce Jobs Launched:Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.47 sec HDFS Read: 62768 HDFS Write: 1666 SUCCESSStage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.73 sec HDFS Read: 6942 HDFS Write: 1480 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 200 msecOKSKYWORTH 11992SAMSUNG 10240YILI 9872TCL 6741OLAY 6442MEIZU 6345ASUS 5705PEAK 5431APPLE 5213MOUTAI 4772SHARP 4721PEACEBIRD 4680MIZUNO 4599DHC 4585NIULANSHAN 4582CAMEL 4569NIKE 4358SEPTWOLVES 4345OPPO 4306NB 4237KAPPA 4092ZARA 4068GUANGMING 4054HP 4043HISENSE 3995HLA 3963HUAWEI 3927KANS 3884LANGJIU 3857NIVEA 3579LINING 3559CLINIQUE 3552LENOVO 3534PUMA 3531HTC 3405GXG 3322UNIQLO 3271HAIER 3106LOREAL 2948WULIANGYE 2912ADIDAS 2847MOTOROLA 2819VIVO 2809DELL 2785SANYUAN 2776LANCOME 2714SELECTED 2633INNISFREE 2452SONY 2353ACER 2339XIAOMI 2260HONGXING 2113ANTA 1990MENGNIU 1776IPHONE 1628SEMIR 1589PHILIPS 1205361 718MEIFUBAO 448Time taken: 146.0 seconds, Fetched: 59 row(s)
查询个年龄段用户消费总额
- HQL语句
[hadoop@hadoopa command]$ cat age_price_list.sqlselect cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) as age,sum(price) as totalPricefrom record join user_dimension on record.uid=user_dimension.uidgroup by cast(DATEDIFF(CURRENT_DATE, birth)/365 as int)order by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f age_price_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313114910_f92cc413-d572-4574-a30e-2cfa9fdcded8Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:49:32 Starting to launch local task to process map join; maximum memory = 5189795842017-03-13 11:49:35 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:49:35 Uploaded 1 File to: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (30382 bytes)2017-03-13 11:49:35 End of local task; Time Taken: 3.073 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0033, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0033/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0033Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:49:57,659 Stage-2 map = 0%, reduce = 0%2017-03-13 11:50:18,269 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.14 sec2017-03-13 11:50:32,451 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 6.71 secMapReduce Total cumulative CPU time: 6 seconds 710 msecEnded Job = job_1489117385526_0033Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0034, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0034/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0034Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:50:50,851 Stage-3 map = 0%, reduce = 0%2017-03-13 11:51:06,025 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.22 sec2017-03-13 11:51:19,065 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.14 secMapReduce Total cumulative CPU time: 5 seconds 140 msecEnded Job = job_1489117385526_0034MapReduce Jobs Launched:Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 6.71 sec HDFS Read: 64575 HDFS Write: 1104 SUCCESSStage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 6338 HDFS Write: 1037 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 850 msecOK46 965344 943610 80682 791815 74709 745640 739114 738843 710937 66716 598011 596918 585830 585329 584122 581939 573720 559713 556441 550321 530612 499823 499128 483035 482933 452819 434725 410936 401732 391017 36981 34687 345916 334426 332831 32863 31884 31418 263942 255745 247324 245734 245447 225427 213738 17965 15340 1232Time taken: 131.25 seconds, Fetched: 48 row(s)
查询个省份消费总额
- HQL语句
[hadoop@hadoopa command]$ cat province_prince_list.sqlselect province,sum(price) as totalPricefrom record join user_dimension on record.uid=user_dimension.uidgroup by user_dimension.provinceorder by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f province_prince_list.sqlwhich: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: trueWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.Query ID = hadoop_20170313115315_d285100f-73a1-44b4-8ffd-9537cbea48e9Total jobs = 2SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]2017-03-13 11:53:44 Starting to launch local task to process map join; maximum memory = 5189795842017-03-13 11:53:48 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable2017-03-13 11:53:48 Uploaded 1 File to: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (35297 bytes)2017-03-13 11:53:48 End of local task; Time Taken: 4.051 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0035, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0035/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0035Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12017-03-13 11:54:20,334 Stage-2 map = 0%, reduce = 0%2017-03-13 11:54:45,944 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.22 sec2017-03-13 11:55:06,202 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.18 secMapReduce Total cumulative CPU time: 9 seconds 180 msecEnded Job = job_1489117385526_0035Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>Starting Job = job_1489117385526_0036, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0036/Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0036Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 12017-03-13 11:55:31,708 Stage-3 map = 0%, reduce = 0%2017-03-13 11:55:51,447 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.66 sec2017-03-13 11:56:15,088 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 6.75 secMapReduce Total cumulative CPU time: 6 seconds 750 msecEnded Job = job_1489117385526_0036MapReduce Jobs Launched:Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 9.18 sec HDFS Read: 62835 HDFS Write: 1045 SUCCESSStage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 6.75 sec HDFS Read: 6333 HDFS Write: 938 SUCCESSTotal MapReduce CPU Time Spent: 15 seconds 930 msecOKZheJiang 11501ShanDong 11336LiaoNing 10519JiLin 10341FuJian 9742XiangGang 9371QingHai 9329ShanXi3 9283GuiZhou 9148HaiNan 9037HuBei 8868NeiMengGu 8245BeiJing 8035JiangXi 7706AnHui 7382HuNan 6666ShangHai 6441JiangSu 6025TaiWan 5988ShanXi1 5862ChongQing 5854XinJiang 5557HeNan 5390XiZang 5322TianJin 4984YunNan 4695GuangXi 4646HeiLongJiang 4639GanSu 4288GuangDong 3753SiChuan 3627Aomen 3233NingXia 2444HeBei 1334Time taken: 182.306 seconds, Fetched: 34 row(s)
第五. 创建user_dimension_orc,brand_dimension_orc,record_orc这三张表。这三张表使用ORC格式保存数据
[hadoop@hadoopa command]$ cat create_table_user_orc.sqlcreate external table if not exists user_dimension_orc ( uid STRING, name STRING, gender STRING, birth DATE, province STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/user_dimension' ;
[hadoop@hadoopa command]$ cat create_table_brand_orc.sqlcreate external table if not exists brand_dimension_orc ( bid STRING, category STRING, brand STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ;
[hadoop@hadoopa command]$ cat create_table_record_orc.sqlcreate table if not exists record_orc ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;
第六.将数据导入到三张ORC表中
hive> insert into table brand_dimension_orc select * from brand_dimension;
hive> insert into table user_dimension_orc select * from user_dimension;
hive> insert overwrite table record_orc partition (partition_date='2017-03-10',hour=22) select rid,uid,bid,trancation_date,price,source_province,target_province,site,express_number,express_company from record
0 0
- 数据分析系统Hive(第二部分)
- 数据分析系统Hive(第一部分)
- 数据分析系统Hive
- 数据分析系统Hive
- hadoop日志分析系统二 第二部分利用hadoop平台进行数据处理 第三种方式 hive+sqoop+zookeeper方式
- 大数据Hive-日志分析系统流程
- hive学习笔记第二部分
- Nginx+FTP+Flume+hadoop+MR+hive+sqoop+mysql+kettle+web数据分析系统(一)
- 【Hadoop系列第十章】(一)数据分析系统Hive
- hive数据分析
- 使用HiVE分析数据
- Hive 股票数据SQL分析[Hive 案例]
- (转)Android深入浅出之Audio 第二部分 AudioFlinger分析
- Linux 系统裁减指南(LiPS)(第二部分)
- 数据库系统概论 学习笔记(三)第二部分
- Mac系统下的一些简单命令(第二部分)
- 第二部分:OLE 数据传送(Data Transfer)
- 第二部分:高级抓取(第七章、清理脏数据)
- 【个人】Jquery个人知识点补充
- Linux 网络故障排查
- linux学习笔记010
- 如何用Vue搭建一个物资管理网站
- 初尝试做产品经理看的觉得好的句子(有转载,也有自己的)
- 数据分析系统Hive(第二部分)
- 关于node--parser【解析中间件】问题全解
- POJ 3186
- 虚函数,多态性,回调函数 简述
- 设计模式---解释器模式(C++实现)
- c++第二次实验(1)
- 警惕!黑客入侵 你的VR头盔真的安全吗?
- jquery 获取当前点击的自定义标签
- iOS开发中WiFi相关功能总结