Hive框架中reduce number的计算
来源:互联网 发布:linux 虚拟主机配置 编辑:程序博客网 时间:2024/05/21 10:35
本文为转载内容,原文地址 http://blog.csdn.net/wisgood/article/details/42125367
我们每次执行hive的hql时,shell里都会提示一段话:
...Number of reduce tasks not specified. Estimated from input data size: 500In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapred.reduce.tasks=<number>...这个是调优的经常手段,主要有一下三个属性来决定
hive.exec.reducers.bytes.per.reducer 这个参数控制每个reducer的期待处理数据量。默认1GB。
This controls how many reducers a map-reduce job should have, depending on the total size of input files to the job. Default is 1GB
hive.exec.reducers.max 这个参数控制最大的reducer的数量, 如果 input / bytes per reduce > max 则会启动这个参数所指定的reduce个数。 这个并不会影响mapre.reduce.tasks参数的设置。默认的max是999。
This controls the maximum number of reducers a map-reduce job can have. If input_file_size divided by "hive.exec.bytes.per.reducer" is greater than this value, the map-reduce job will have this value as the number reducers. Note this does not affect the number of reducers directly specified by the user through "mapred.reduce.tasks" and query hints
mapred.reduce.tasks 这个参数如果指定了,hive就不会用它的estimation函数来自动计算reduce的个数,而是用这个参数来启动reducer。默认是-1.
This overrides the hadoop configuration to make sure we enable the estimation of the number of reducers by the size of the input files. If this value is non-negative, then hive will pass this number directly to map-reduce jobs instead of doing the estimation.
reduce的个数设置其实对执行效率有很大的影响:
1、如果reduce太少: 如果数据量很大,会导致这个reduce异常的慢,从而导致这个任务不能结束,也有可能会OOM
2、如果reduce太多: 产生的小文件太多,合并起来代价太高,namenode的内存占用也会增大。
如果我们不指定mapred.reduce.tasks, hive会自动计算需要多少个reducer。
计算的公式: reduce个数 = InputFileSize / bytes per reducer
这个数个粗略的公式,详细的公式在:
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
我们先看下:
1、计算输入文件大小的方法:其实很简单,遍历每个路径获取length,累加。
+ * Calculate the total size of input files.+ * @param job the hadoop job conf.+ * @return the total size in bytes.+ * @throws IOException + */+ public static long getTotalInputFileSize(JobConf job, mapredWork work) throws IOException {+ long r = 0;+ FileSystem fs = FileSystem.get(job);+ // For each input path, calculate the total size.+ for (String path: work.getPathToAliases().keySet()) {+ ContentSummary cs = fs.getContentSummary(new Path(path));+ r += cs.getLength();+ }+ return r;+ }2、估算reducer的个数,及计算公式:
注意最重要的一句话: int reducers = (int)((totalInputFileSize + bytesPerReducer - 1) / bytesPerReducer);
+ /**+ * Estimate the number of reducers needed for this job, based on job input,+ * and configuration parameters.+ * @return the number of reducers.+ */+ public int estimateNumberOfReducers(HiveConf hive, JobConf job, mapredWork work) throws IOException {+ long bytesPerReducer = hive.getLongVar(HiveConf.ConfVars.BYTESPERREDUCER);+ int maxReducers = hive.getIntVar(HiveConf.ConfVars.MAXREDUCERS);+ long totalInputFileSize = getTotalInputFileSize(job, work);++ LOG.info("BytesPerReducer=" + bytesPerReducer + " maxReducers=" + maxReducers + + " totalInputFileSize=" + totalInputFileSize);+ int reducers = (int)((totalInputFileSize + bytesPerReducer - 1) / bytesPerReducer);+ reducers = Math.max(1, reducers);+ reducers = Math.min(maxReducers, reducers);+ return reducers; + }3、真正的计算流程代码:
+ /**+ * Set the number of reducers for the mapred work.+ */+ protected void setNumberOfReducers() throws IOException {+ // this is a temporary hack to fix things that are not fixed in the compiler+ Integer numReducersFromWork = work.getNumReduceTasks();+ + if (numReducersFromWork != null && numReducersFromWork >= 0) {+ LOG.info("Number of reduce tasks determined at compile: " + work.getNumReduceTasks());+ } else if(work.getReducer() == null) {+ LOG.info("Number of reduce tasks not specified. Defaulting to 0 since there's no reduce operator");+ work.setNumReduceTasks(Integer.valueOf(0));+ } else {+ int reducers = estimateNumberOfReducers(conf, job, work);+ work.setNumReduceTasks(reducers);+ LOG.info("Number of reduce tasks not specified. Estimated from input data size: " + reducers); } }这就是reduce个数计算的原理。
- Hive框架中reduce number的计算
- Hive中reduce个数设定
- Hive中reduce个数设定
- Hadoop中Reduce任务的执行框架
- Hadoop中Reduce任务的执行框架
- hive计算map数和reduce数
- hive中reduce输出大文件的处理
- hive中如何确定一个mapreduce作业的reduce数量
- hive中控制map和reduce的个数
- HIVE的MAP/REDUCE原理
- hive设置reduce的最大值
- reduce计算数组中元素出现的次数
- hive中map和reduce优化
- MapReduce框架中map、reduce方法的运行方式
- MapReduce框架中map、reduce方法的运行机制
- Hadoop如何计算map数和reduce数(hive,hbase)
- hive 的map reduce的设置
- 7、 数据仓库Hive(使用sql进行计算的hadoop框架)
- 日期控件操作大全
- RXBus的翻译
- Linux - ubuntu cheat sheet
- MFC中的部分函数
- MyBatis操作mysql配置和获取插入记录的自增主键
- Hive框架中reduce number的计算
- java.util.ConcurrentModificationException错误
- 添物不花钱学JavaEE(基础篇)-综述
- 模拟取款
- linux之svn回滚/回退到某个版本
- Hbuider hybrid app开发之地图操作方法
- java泛型
- poj3660 Cow Contest
- reinterpret_cast <new_type> (expression)