Hadoop源码分析36 Child的Reduce分析
来源:互联网 发布:15万suv还是轿车 知乎 编辑:程序博客网 时间:2024/06/13 21:34
分析任务reduce_1
args =[127.0.0.1, 42767, attempt_201405060431_0003_r_000001_0,/opt/hadoop-1.0.0/logs/userlogs/job_201405060431_0003/attempt_201405060431_0003_r_000001_0,1844231936]
myTask = JvmTask{ shouldDie=false, t=ReduceTask{jobFile="/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405060431_0003/job.xml",taskId=attempt_201405020918_0003_r_000001_0,taskProgress=reduce,taskStatus=ReduceTaskStatus{UNASSIGNED}}
job=JobConf{Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml, hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405060431_0003/job.xml}
outputFormat= TextOutputFormat@51386c70
committer = FileOutputCommitter{outputFileSystem=DFSClient,
outputPath=/user/admin/out/123 ,
workPath=hdfs://server1:9000/user/admin/out/123/_temporary/_attempt_201405060431_0003_r_000001_0}
ReduceCopier的workDir=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405020918_0003/_attempt_201405060431_0003_r_000001_0
ReduceCopier的jar=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405020918_0003/jars/job.jar
ReduceCopier的jobCacheDir=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405020918_0003/jars
ReduceCopier的numCopiers
ReduceCopier的maxInFlight= 20
ReduceCopier的combinerRunner=CombinerRunner{job={Configuration: core-default.xml,core-site.xml, mapred-default.xml, mapred-site.xml,hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405020918_0003/job.xml,committer=null,keyClass=org.apache.hadoop.io.Text,valueClass=org.apache.hadoop.io.IntWritable},
ReduceCopier的combineCollector =Task$CombineOutputCollector@72447399{progressBar=10000,}
ReduceCopier的ioSortFactor=10
ReduceCopier的maxInMemOutputs=1000
ReduceCopier的maxInMemOutputs=0.66
ReduceCopier的maxRedPer=0.0
ReduceCopier的ramManager=ReduceTask$ReduceCopier$ShuffleRamManager@46e9d255 {maxSize=141937872, maxSingleShuffleLimit=35484468}
ReduceCopier的线程copiers={
[Thread[MapOutputCopierattempt_201405020918_0003_r_000000_1.0,5,main],
Thread[MapOutputCopierattempt_201405020918_0003_r_000000_1.1,5,main],
Thread[MapOutputCopierattempt_201405020918_0003_r_000000_1.2,5,main],
Thread[MapOutputCopierattempt_201405020918_0003_r_000000_1.3,5,main],
Thread[MapOutputCopierattempt_201405020918_0003_r_000000_1.4,5,main]]}
ReduceCopier的线程localFSMergerThread=Thread[Threadfor merging on-disk files,5,main]
ReduceCopier的线程inMemFSMergeThread=Thread[Threadfor merging in memory files,5,main]
ReduceCopier的线程getMapEventsThread=Thread[Threadfor polling Map Completion Events,5,main]
线程getMapEventsThread的RPC请求:getMapCompletionEvents(JobID=job_201405060431_0003,fromEventId=2, MAX_EVENTS_TO_FETCH= 10000,TaskID=attempt_201405060431_0003_r_000001_0,jvmContext={jvmId=jvm_201405060431_0003_r_1844231936,pid= 10727 })
线程getMapEventsThread的RPC响应:
[Task Id : attempt_201405060431_0003_m_000001_0,Status : SUCCEEDED,
Task Id : attempt_201405060431_0003_m_000000_0,Status : SUCCEEDED]
放入mapLocations={
server2=[ReduceTask$ReduceCopier$MapOutputLocation{taskAttemptId=attempt_201405060431_0003_m_000000_0,taskId=task_201405060431_0003_m_000000,taskOutput=http://server2:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000000_0&reduce=0}],
server3=[ReduceTask$ReduceCopier$MapOutputLocation{taskAttemptId=attempt_201405060431_0003_m_000001_0,taskId=task_201405060431_0003_m_000001,taskOutput=http://server3:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000001_0&reduce=0}]
}
混洗server,打乱server的顺序:
hostList.addAll(mapLocations.keySet());
Collections.shuffle(hostList,this.random);
再一个个放入容器:
uniqueHosts.add(host);
scheduledCopies.add(loc);
主线程唤醒MapOutputCopier线程:scheduledCopies.notifyAll()
线程MapOutputCopier的HTTP请求:http://server3:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000000_0&reduce=1
因数据比较少,写到内存中(shuffleData->mapOutput.data->ReduceCopier.mapOutputsFilesInMemory
线程MapOutputCopier的HTTP请求:http://server2:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000001_0&reduce=1
因数据比较少,写到内存中(shuffleData->mapOutput.data->ReduceCopier.mapOutputsFilesInMemory
更新copyResults={CopyResult{MapOutputLocation=http://server3:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000000_0&reduce=1},CopyResult{MapOutputLocation=http://server2:50060/mapOutput?job=job_201405060431_0003&map=attempt_201405060431_0003_m_000001_0&reduce=1}}
Merge内容:reduceCopier.createKVIterator(job,rfs, reporter)
合并两部分到:/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405060431_0003/attempt_201405060431_0003_r_000001_0/output/map_0.out
这里用的也是优先级队列(小根堆Heap)
添加到mapOutputFilesOnDisk = {/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405060431_0003/attempt_201405060431_0003_r_000001_0/output/map_0.out}
因为只有一个文件,故而不需要继续Merge。
运行reduce:reducer.run(reducerContext)
RPC请求:commitPending(taskId=attempt_201405060431_0003_r_000001_0,taskStatus=COMMIT_PENDING,jvmContext);
RPC响应:无
RPC请求:canCommit(taskId=attempt_201405060431_0003_r_000001_0,jvmContext);
RPC响应:true
提交任务:
复制hdfs://server1:9000/user/admin/out/123/_temporary/_attempt_201405060431_0003_r_000001_0/part-r-00001
到:/user/admin/out/123/part-r-00001
最后的CleanUp Task:
[127.0.0.1, 42767,attempt_201405060431_0003_m_000002_0,/opt/hadoop-1.0.0/logs/userlogs/job_201405060431_0003/attempt_201405060431_0003_m_000002_0,47579841]
JvmTask={ shouldDie=false,t=MapTask {jobCleanup=true,jobFile="/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201405060431_0003/job.xml}}
删除文件:/user/admin/out/123/_temporary
创建文件:/user/admin/out/123/_SUCCESS
删除文件:hdfs://server1:9000/tmp/hadoop-admin/mapred/staging/admin/.staging/job_201405060431_0003
- Hadoop源码分析36 Child的Reduce分析
- Hadoop源码分析34 Child的Map
- Hadoop源码分析33 Child的主要流程
- hadoop源码分析(2):Map-Reduce的过程解析
- Hadoop源码分析的思路
- 分析reduce()的原理
- mapreduce源码分析之Reduce任务的运行
- spark1.2.0源码分析之RDD的reduce操作
- Hadoop源码分析-HDFS
- Hadoop RPC源码分析
- hadoop datanode源码分析
- hadoop datanode源码分析
- Hadoop RPC源码分析
- hadoop datanode源码分析
- hadoop 源码分析一
- Hadoop源码分析_DatanodeDescriptor
- Hadoop源码分析_DatanodeInfo
- hadoop源码分析 jobsplit
- Hadoop源码分析32 TaskTracker流程
- Hadoop源码分析33 Child的主要流程
- Hadoop源码分析34 Child的Map
- Collection测试
- Hadoop源码分析35 QuickSort & HeapSort
- Hadoop源码分析36 Child的Reduce分析
- 傅里叶变换
- 数学公式和标点符号的英文读法
- 浅谈PROFINET IO通信的实时性
- Synchnorized 辨析 (1)
- 线程间通信 wait()/notify() 用例
- Hadoop源码分析37 RPC的线程协作
- 跟我一起学makefile
- Android无线网络使用无线键盘、触摸屏操作手机