    hadoop安装完成后,默认的机架感知实现是所有节点在一个机架,即,default rack。不管在哪个数据中心,哪个楼层,哪个机房,同一个集群节点都是在同一个机架上,这就需要我们自定义机架感知。




创建作业会产生job.xml和切片文件、切片元数据文件、jar文件、配置文件,这些文件全都要上传到hdfs,之所以上传到hdfs,是因为hdfs具有集群节点内数据是共享的特性,所有节点都可访问这些数据。上传后,资源管理器会计算作业执行的所需资源,返回执行作业的节点列表,根据执行节点列表运行作业。假如是10节点,那么这是个10节点都需要从hdfs上下载上传的作业文件到本地,如:java程序(jar文件),若jar文件不下载到本地是无法运行的,还有类路径等等。若不将作业执行的文件上传到hdfs,执行作业的节点是无法下载不到。其实,在本地通过local run运行的时候,其实它放在了临时路径下,后来又多了一个local临时目录,也是放在本地。针对conf的配置,其实就是针对job的配置。若有多个reduce,在每个map输出之前,每个map都要进行分区,每个map上的分区的个数和reduce的个数相同,并且分区的编号是和reduce对应的。所以,map在本地输出key-value的时候,不仅把输出的数据做一个按key的排序和分组,其实就是放在分区里面去。现在要进行shuffle,要往reduce传输数据。



Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to
specify a combiner function to be run on the map output, and the combiner function’s
output forms the input to the reduce function. Because the combiner function is an
optimization, Hadoop does not provide a guarantee of how many times it will call it for a
particular map output record, if at all. In other words, calling the combiner function zero,
one, or many times should produce the same output from the reducer.
The contract for the combiner function constrains the type of function that may be used.
This is best illustrated with an example. Suppose that for the maximum temperature
example, readings for the year 1950 were processed by two maps (because they were in
different splits). Imagine the first map produced the output:
    (1950, 0)
    (1950, 20)
    (1950, 10)
and the second produced:
    (1950, 25)
    (1950, 15)
The reduce function would be called with a list of all the values:
     (1950, [0, 20, 10, 25, 15])
with output:
     (1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like
the reduce function, finds the maximum temperature for each map output. The reduce
function would then be called with:
    (1950, [20, 25])
and would produce the same output as before. More succinctly, we may express the
function calls on the temperature values in this case as follows:
      max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Not all functions possess this property.[20] For example, if we were calculating mean
temperatures, we couldn’t use the mean as our combiner function, because:
     mean(0, 20, 10, 25, 15) = 14
     mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
The combiner function doesn’t replace the reduce function. (How could it? The reduce
function is still needed to process records with the same key from different maps.) But it
can help cut down the amount of data shuffled between the mappers and the reducers, and
for this reason alone it is always worth considering whether you can use a combiner
function in your MapReduce job.

Specifying a combiner function

Going back to the Java MapReduce program, the combiner function is defined using the
Reducer class, and for this application, it is the same implementation as the reduce
function in MaxTemperatureReducer. The only change we need to make is to set the
combiner class on the Job (see Example 2-6).
Example 2-6. Application to find the maximum temperature, using a combiner function for

package myhadoop;public class MaxTemperatureWithCombiner {public static void main(String[] args) throws Exception {if (args.length != 2) {System.err.println("Usage: MaxTemperatureWithCombiner <input path> "+ "<output path>");System.exit(-1);}Job job = new Job();job.setJarByClass(MaxTemperatureWithCombiner.class);job.setJobName("Max temperature");FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(MaxTemperatureMapper.class);job.setCombinerClass(MaxTemperatureReducer.class);job.setReducerClass(MaxTemperatureReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);System.exit(job.waitForCompletion(true) ? 0 : 1);}}