Hadoop源码分析34 Child的Map
来源:互联网 发布:15万suv还是轿车 知乎 编辑:程序博客网 时间:2024/06/10 06:41
提交作业:
hadoop
生成2个Map、2个Reduce任务。
执行Maps[0]:
args=[127.0.0.1, 40996, attempt_201404282305_0001_m_000000_0,/opt/hadoop-1.0.0/logs/userlogs/job_201404282305_0001/attempt_201404282305_0001_m_000000_0,-518526792]
cwd=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/attempt_201404282305_0001_m_000000_0/work
jobTokenFile=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/jobToken
jvmContext=JvmContext(jvmId=jvm_201404282305_0001_m_,pid=29184)
myTask=JvmTask{shouldDie=false,t=MapTask{taskId=attempt_201404282305_0001_m_000000_0,jobFile="/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml"
job=JobConf{Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml}
Task的jobContext=JobContext{job=Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml, hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml,id=job_201404282305_0001,}
Task的taskContext=TaskAttemptContext(job=Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml, hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml,id=job_201404282305_0001,taskId=attempt_201404282305_0001_m_000000_0, reporter=org.apache.hadoop.mapred.Task$TaskReporter@67323b17);
Task的taskStatus=RUNNING
Task的outputFormat=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat@a27b2d9
Task的committer=FileOutputCommitter{outputFileSystem=DFS[DFSClient[clientName=DFSClient_attempt_201404282305_0001_m_000000_0,ugi=admin]],outputPath=/user/admin/out/128,workPath=hdfs://server1:9000/user/admin/out/128/_temporary/_attempt_201404282305_0001_m_000000_0}
runNewMapper(job,splitMetaInfo,umbilical, reporter)里面的变量:
taskContext=TaskAttemptContext{conf=JobConf
mapper=org.apache.hadoop.examples.WordCount$TokenizerMapper@3e4a762a;
inputFormat=org.apache.hadoop.mapreduce.lib.input.TextInputFormat@21833d8a;
split=hdfs://server1:9000/user/admin/in/yellow2.txt:0+67108864;
input=NewTrackingRecordReader{
inputSplit=hdfs://server1:9000/user/admin/in/yellow2.txt:0+67108864,
job=JobConf{Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml, hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml}
real=org.apache.hadoop.mapreduce.lib.input.LineRecordReader@5afbee67
};
output= NewOutputCollector{
collector=newMapOutputBuffer{............},
partitions=2,
partitioner=org.apache.hadoop.mapreduce.lib.partition.HashPartitioner@5c66b7ea,
};
output的成员collector(NewOutputCollector类型):
job=JobConf{Configuration:core-default.xml, core-site.xml, mapred-default.xml,mapred-site.xml, hdfs-default.xml, hdfs-site.xml,/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/job.xml},
localFs=LocalFileSystem@fdcb254,
partitions=2,
rfs=rawLocalFileSystem,
PARTITION= 0;// partitionoffset in acct
KEYSTART=1;
VALSTART=2;
ACCTSIZE=3;
RECSIZE=(ACCTSIZE+ 1) * 4 =16;
spillper=0.8,
recper=0.05,
sortmb=100,
sorter=org.apache.hadoop.util.QuickSort@aa9502d,
maxMemUsage= sortmb << 20;
recordCapacity= (int)(maxMemUsage* recper);
recordCapacity-= recordCapacity %RECSIZE;
kvbuffer=newbyte[maxMemUsage- recordCapacity];// byte[99614720]
bufvoid=kvbuffer.length;
recordCapacity/=RECSIZE;// 327680
kvoffsets=newint[recordCapacity];
kvindices=newint[recordCapacity*ACCTSIZE];
softBufferLimit=(int)(kvbuffer.length*spillper);//79691776
softRecordLimit=(int)(kvoffsets.length*spillper);//262144
comparator=job.getOutputKeyComparator();
keyClass=
valClass=
combinerRunner=CombinerRunner.create(job, getTaskID(),
//NewCombinerRunner{reducerClass=org.apache.hadoop.examples.WordCount$IntSumReducer,taskId=attempt_201404282305_0001_m_000000_0,keyClass=org.apache.hadoop.io.Text,valueClass=org.apache.hadoop.io.IntWritable,comparator=org.apache.hadoop.io.Text$Comparator@42b5e6a1,committer=null}
combineCollector=newCombineOutputCollector(combineOutputCounter,reporter,conf);
minSpillsForCombine=job.getInt("min.num.spills.for.combine",3);//3
spillThread== newSpillThread();
spillLock=newReentrantLock();
spillDone=spillLock.newCondition();
spillReady=spillLock.newCondition();
mapperContext=Mapper$Context{taskId=attempt_201404282305_0001_m_000000_0,status="",split=hdfs://server1:9000/user/admin/in/yellow2.txt:0+67108864,jobId=job_201404282305_0001,committer=FileOutputCommitter{outputFileSystem=DFS[DFSClient[clientName=DFSClient_attempt_201404282305_0001_m_000000_0,ugi=admin]],outputPath=/user/admin/out/128,workPath=hdfs://server1:9000/user/admin/out/128/_temporary/_attempt_201404282305_0001_m_000000_0},output=MapTask$NewOutputCollector{...与前面同...},reader=MapTask$NewTrackingRecordReader{...与前面input同...},
}
input.initialize(split,mapperContext)
//设置LineRecordReader的成员,打开文件
start=split.getStart();
end=start+split.getLength();
pos=start;
FSDataInputStreamfileIn = fs.open(split.getPath());
in=newLineReader(fileIn,job);
bufferSize=65536;
buffer=newbyte[this.bufferSize];
mapper.run(mapperContext)
//运行Map程序
//nextKeyValue()方法,读取一行,读前pos=0,读后key=0,value=Yellow,pos=7.
......
//读前pos=20,读后key=20,value=“Lookat the stars; look how they shine for you”,pos=67.
......
//读前pos=68,读后key=68,value=“Andeverything you do”,pos=90.
......
//
Map方法:
publicvoidmap(Objectkey, Text value, Context context
Mapper$Context.write(word,one)
-->
-->
-->
HashPartitioner.getPartition()定义为:
即通过Key的Hashcode除以Recude的余数,确定属于哪个Reduce
collect的处理流程为:
keySerializer.serialize(key);//写入Key到BlockingBuffer的kvbuffer
valSerializer.serialize(value);//写入Value到BlockingBuffer的kvbuffer
intind =kvindex*ACCTSIZE;
kvoffsets[kvindex]= ind; //一级索引,kvindices中的位置
kvindices[ind+PARTITION]= partition;
kvindices[ind+KEYSTART]=keystart;
kvindices[ind+VALSTART]=valstart;
kvindex=kvnext;
SpillThread线程将内存数据排序,并写入本地磁盘
MapTask.MapOutputBuffer.sortAndSpill()处理:
size=2276473,partitions=2,numSpills=0,
filename=/tmp/hadoop-admin/mapred/local/taskTracker/admin/jobcache/job_201404282305_0001/attempt_201404282305_0001_m_000000_0/output/spill0.out,
endPosition=262144
排序 sorter.sort(MapOutputBuffer.this,kvstart,endPosition,reporter)
Combine:combinerRunner.combine(kvIter,combineCollector);
运行Combine
写入结果key、value到spill0.out、spill1.out、spill2.out......:
keySerializer.serialize(key);
valueSerializer.serialize(value);
mergeParts()合并结果
Merge完成以后还要运行combine:
combinerRunner.combine(kvIter,combineCollector);
最后生成两个文件:
file.out
file.out.index
sortAndSpill、mergeParts过程:
Map阶段:主线程在Map阶段将所有结果写入内存kvbuffer;SpillThread线程将kvbuffer中内容分块进行Sort(快速排序)、Combine,写入文件spill0.out、spill1.out、spill2.out......spill51.out中,每个文件包括了2个Reduce的内容。
output.close()阶段:主线程将内存中最后一块进行Sort(快速排序)、Combine,写入spill52.out中,每个文件包括了2个Reduce的内容。
mergeParts阶段:主线程将文件spill0.out、spill1.out......spill52.out加入 优先级队列
再进行Combine,写入文件file.out,包括了2个Reduce的内容。
注意这里利用了快速排序 (Quit Sort),小顶堆(Heap)等数据结构和算法。
注意
- Hadoop源码分析34 Child的Map
- Hadoop源码分析36 Child的Reduce分析
- Hadoop源码分析33 Child的主要流程
- Hadoop源码分析,map输入
- hadoop源码分析,map输出
- hadoop源码分析(2):Map-Reduce的过程解析
- hadoop中map分片信息的源码分析
- Hadoop 参数 mapred.map.child.java.opts
- hadoop map任务Combiner被调用的源码逻辑简要分析
- Hadoop源码分析的思路
- Hadoop-2.4.1学习之Map任务源码分析(上)
- Hadoop-2.4.1学习之Map任务源码分析(下)
- List、Set、Map的源码初级分析
- JDK8中Map接口的源码分析
- hadoop的job提交的源码分析
- hadoop源码研究--Map (4)
- hadoop的mapred工作原理---源码分析
- Hadoop 中 IPC 的源码分析
- Hadoop源码分析29 split和splitmetainfo
- Hadoop源码分析30 JobInProgress 的 TaskInProgress 执行情况
- Hadoop源码分析31 TaskTracke成员
- Hadoop源码分析32 TaskTracker流程
- Hadoop源码分析33 Child的主要流程
- Hadoop源码分析34 Child的Map
- Collection测试
- Hadoop源码分析35 QuickSort & HeapSort
- Hadoop源码分析36 Child的Reduce分析
- 傅里叶变换
- 数学公式和标点符号的英文读法
- 浅谈PROFINET IO通信的实时性
- Synchnorized 辨析 (1)
- 线程间通信 wait()/notify() 用例