Hadoop MapReduce之ReduceTask任务执行(四)
来源:互联网 发布:choice数据 价格 编辑:程序博客网 时间:2024/05/18 13:28
上一篇讲了reduce如何把map输出下载到本地的过程,这个过程中包含了文件合并操作,本文主要讲reduce的下一个阶段:排序。reduce端的合并单位是Segment,在对Segment合并的过程中就已经实现排序了,大家如果对Oracle比较熟悉的话,这种合并排序的方式就容易理解了,对于两个排序好的数组,每次取其中的最小值,那么结果就是一个大的有序数组,这就是merge的基本原理,当然在Hadoop中,一个Segment代表一组有序的KV值,reduce会把多个Segment放入一个优先级队列中MergeQueue,每次读取后会调整队列,确保最小的一个永远被先读取,那么对于reduce的输入来说输入就变为有序的了。
Segment的存放有两个地方:内存和磁盘,合并是优先合并内存中的Segment,以便清理出内存供reduce使用,当然也不会全部将内存中的数据刷新的磁盘中,因为这些数据还是要传递给reduce函数的,所以留在内存中的这部分数据会直接作为reduce输入。reduce的输入缓存由参数mapred.job.reduce.input.buffer.percent控制,默认为0,如果reduce端的IO比较繁忙可以调大这个值,以减少IO操作。排序阶段的代码如下:
Segment的存放有两个地方:内存和磁盘,合并是优先合并内存中的Segment,以便清理出内存供reduce使用,当然也不会全部将内存中的数据刷新的磁盘中,因为这些数据还是要传递给reduce函数的,所以留在内存中的这部分数据会直接作为reduce输入。reduce的输入缓存由参数mapred.job.reduce.input.buffer.percent控制,默认为0,如果reduce端的IO比较繁忙可以调大这个值,以减少IO操作。排序阶段的代码如下:
/** * Create a RawKeyValueIterator from copied map outputs. All copying * threads have exited, so all of the map outputs are available either in * memory or on disk. We also know that no merges are in progress, so * synchronization is more lax, here. * * The iterator returned must satisfy the following constraints: * 1. Fewer than io.sort.factor files may be sources * 2. No more than maxInMemReduce bytes of map outputs may be resident * in memory when the reduce begins * * If we must perform an intermediate merge to satisfy (1), then we can * keep the excluded outputs from (2) in memory and include them in the * first merge pass. If not, then said outputs must be written to disk * first. */ @SuppressWarnings("unchecked") private RawKeyValueIterator createKVIterator( JobConf job, FileSystem fs, Reporter reporter) throws IOException { // merge config params Class<K> keyClass = (Class<K>)job.getMapOutputKeyClass(); Class<V> valueClass = (Class<V>)job.getMapOutputValueClass(); boolean keepInputs = job.getKeepFailedTaskFiles(); final Path tmpDir = new Path(getTaskID().toString()); final RawComparator<K> comparator = (RawComparator<K>)job.getOutputKeyComparator(); // segments required to vacate memory List<Segment<K,V>> memDiskSegments = new ArrayList<Segment<K,V>>(); long inMemToDiskBytes = 0; if (mapOutputsFilesInMemory.size() > 0) { TaskID mapId = mapOutputsFilesInMemory.get(0).mapId; //获得内存Segment inMemToDiskBytes = createInMemorySegments(memDiskSegments, maxInMemReduce); final int numMemDiskSegments = memDiskSegments.size(); //检查是否需要将内存Segment刷新到磁盘中 if (numMemDiskSegments > 0 && ioSortFactor > mapOutputFilesOnDisk.size()) { // 合并Segment并刷新到磁盘中 final Path outputPath = mapOutputFile.getInputFileForWrite(mapId, inMemToDiskBytes); final RawKeyValueIterator rIter = Merger.merge(job, fs, keyClass, valueClass, memDiskSegments, numMemDiskSegments, tmpDir, comparator, reporter, spilledRecordsCounter, null); final Writer writer = new Writer(job, fs, outputPath, keyClass, valueClass, codec, null); try { Merger.writeFile(rIter, writer, reporter, job); addToMapOutputFilesOnDisk(fs.getFileStatus(outputPath)); } catch (Exception e) { if (null != outputPath) { fs.delete(outputPath, true); } throw new IOException("Final merge failed", e); } finally { if (null != writer) { writer.close(); } } LOG.info("Merged " + numMemDiskSegments + " segments, " + inMemToDiskBytes + " bytes to disk to satisfy " + "reduce memory limit"); inMemToDiskBytes = 0; memDiskSegments.clear(); } else if (inMemToDiskBytes != 0) { LOG.info("Keeping " + numMemDiskSegments + " segments, " + inMemToDiskBytes + " bytes in memory for " + "intermediate, on-disk merge"); } } // 处理磁盘Segment,方式为把所有的磁盘Segment放入diskSegments集合中,在merge操作的时候会放入merge队列中 List<Segment<K,V>> diskSegments = new ArrayList<Segment<K,V>>(); long onDiskBytes = inMemToDiskBytes; Path[] onDisk = getMapFiles(fs, false); for (Path file : onDisk) { onDiskBytes += fs.getFileStatus(file).getLen(); diskSegments.add(new Segment<K, V>(job, fs, file, codec, keepInputs)); } LOG.info("Merging " + onDisk.length + " files, " + onDiskBytes + " bytes from disk"); Collections.sort(diskSegments, new Comparator<Segment<K,V>>() { public int compare(Segment<K, V> o1, Segment<K, V> o2) { if (o1.getLength() == o2.getLength()) { return 0; } return o1.getLength() < o2.getLength() ? -1 : 1; } }); // 将内存中的Segment和磁盘中的Segment合并在一起 List<Segment<K,V>> finalSegments = new ArrayList<Segment<K,V>>(); long inMemBytes = createInMemorySegments(finalSegments, 0); LOG.info("Merging " + finalSegments.size() + " segments, " + inMemBytes + " bytes from memory into reduce"); if (0 != onDiskBytes) { final int numInMemSegments = memDiskSegments.size(); //将内存Segment合并至磁盘Segment diskSegments.addAll(0, memDiskSegments); memDiskSegments.clear(); RawKeyValueIterator diskMerge = Merger.merge( job, fs, keyClass, valueClass, codec, diskSegments, ioSortFactor, numInMemSegments, tmpDir, comparator, reporter, false, spilledRecordsCounter, null); diskSegments.clear(); //如果最后一次内存Segment合并为0,那么直接返回磁盘Segment集合 if (0 == finalSegments.size()) { return diskMerge; } finalSegments.add(new Segment<K,V>( new RawKVIteratorReader(diskMerge, onDiskBytes), true)); } return Merger.merge(job, fs, keyClass, valueClass, finalSegments, finalSegments.size(), tmpDir, comparator, reporter, spilledRecordsCounter, null); }
- Hadoop MapReduce之ReduceTask任务执行(四)
- Hadoop MapReduce之ReduceTask任务执行(四):排序与合并
- Hadoop MapReduce之ReduceTask任务执行(一)
- Hadoop MapReduce之ReduceTask任务执行(二)
- Hadoop MapReduce之ReduceTask任务执行(三)
- Hadoop MapReduce之ReduceTask任务执行(五)
- Hadoop MapReduce之ReduceTask任务执行(六)
- Hadoop MapReduce之ReduceTask任务执行(二):GetMapEventsThread线程
- Hadoop MapReduce之ReduceTask任务执行(一):远程拷贝map输出
- Hadoop MapReduce之ReduceTask任务执行(三):Merger线程分析
- Hadoop MapReduce之MapTask任务执行(四)
- Hadoop MapReduce之MapTask任务执行(一)
- Hadoop MapReduce之MapTask任务执行(二)
- Hadoop MapReduce之MapTask任务执行(三)
- Hadoop(四)组合式MapReduce任务
- Hadoop MapReduce之任务启动(一)
- Hadoop MapReduce之任务启动(二)
- hadoop-mapreduce中reducetask运行分析
- Hadoop工具生态系统指南
- 【转载】ecshop lang用法
- ECshop 商品属性后台添加,前台显示
- ECshop 修改100例
- 其他网站看来的20条找工作经验
- Hadoop MapReduce之ReduceTask任务执行(四)
- ECshop 后台添加商品的导航增添一个规格参数
- ECshop 详情页添加鼠标移动到文字上自动弹出该选项的简介
- 商品详情页显示商品的详细信息
- ecshop商品详细页显示已售商品数量和评论数量
- mybatis+spring3实战2-sqlSessionDaoSupport方式
- ECshop显示优惠价和折扣率的方法
- ECShop——给商品详情页添加字段
- 【转载】Ecshop中根据评论等级不同计算出百分比