Map Reduce commit job 优化
来源:互联网 发布:招python测试工程师 编辑:程序博客网 时间:2024/04/29 19:12
经常会看到用户的job在所有的map和reduce都完成之后,还需要几分钟时间才能finish。这个阶段主要在进行job output的commit过程。
MR v2中有进行这部分的优化。
https://issues.apache.org/jira/browse/MAPREDUCE-4815
https://issues.apache.org/jira/browse/MAPREDUCE-6275
https://issues.apache.org/jira/browse/MAPREDUCE-6280
目前看来在hadoop 2.7之后才有这些功能,但是还是有坑。
在FileOutputCommitter中的commitJob方法中,可以看到根据mapreduce.fileoutputcommitter.algorithm.version的不同,会有不同的处理逻辑。
org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java
mapreduce.fileoutputcommitter.algorithm.version
官方文档中的介绍https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml非常清楚,如下:
总结来说,就是减少了一步rename的过程,而且老版本中commitJob是单线程串行rename大量output,这本身很花时间。现在新版本中,只是rename一个文件夹就行了,可以大大提高速度。
The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1, which is the original algorithm In algorithm version 1, 1. commitTask will rename directory $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ 2. recoverTask will also do a rename $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/ 3. commitJob will merge every task output file in $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS It has a performance regression, which is discussed in MAPREDUCE-4815. If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing. algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob. 1. commitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/ 2. recoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/ 3. commitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.
- Map Reduce commit job 优化
- map-reduce 优化
- 把Job分割成map和reduce
- hive优化 map+reduce+split
- hadoop 一个Job多个MAP与REDUCE的执行
- Amazone Map-Reduce中启动Job处理压缩文件
- hadoop 一个Job多个MAP与REDUCE的执行
- hadoop 一个Job多个MAP与REDUCE的执行
- Hadoop Job 中 Map 与 Reduce 数量控制
- map/reduce优化的几点建议
- hive中map和reduce优化
- map/reduce
- map reduce
- Map/Reduce
- map reduce
- Map Reduce
- map reduce
- map-reduce
- DP 合并石子
- 2017.4.3 校内赛 图论(bricks)(drive)(graph)(airport)
- Problem L: STL——字符串排序
- 2015年第六届蓝桥杯大赛个人赛决赛(软件类)真题 标题:方格填数
- Java线程安全
- Map Reduce commit job 优化
- 卷积神经网络的网络结构——以LeNet-5为例
- 采样与压缩感知
- 实验二 图像文件的读写和转换(BMP转YUV)
- Problem M: STL——整理唱片
- 二叉树的遍历:前序遍历、中序遍历和后序遍历
- Vijos 1523 贪吃的九头龙 【树形DP】
- Android权威编程指南第14章的小bug
- 二分图匹配——BZOJ1059/Luogu1129 [ZJOI2007]矩阵游戏