Hadoop第10天-03.Map过程分析-combiner-partition

来源:互联网 发布:js 计算时间间隔 天 编辑:程序博客网 时间:2024/06/06 00:39

map作业优化策略

map作业:不同节点执行map后,按照key排序,按照key值进行合并,相同的key合在一起,再传给reduce处理。

切片大小和block大小相当

mapreduce期间尽量保证切片大小和块大小相同,一是为了减少网络间的拷贝,二是为了减少切片数量,这也是数据本地化的前提。

数据本地化策略

数据在哪个节点,便在哪个节点执行map作业,这样做的话,读取文件是从本地读取,自然速度快。如果数据不在本地,则需要通过网络拷贝,然后读取数据、执行作业,拷贝的数据很大的话,网络负载较大。但,若数据在本地,然而本地资源cpu、内存等被其他任务沾满了,只能去该数据的备份节点执行,优先选择同一个机架内的空闲节点,即,机架本地化。reduce任务并不具备数据本地话的优势,

机架本地化

知识点回顾:机架感知,hadoop其实并不知道在哪个路由器交换数据,它其实是一个网络拓扑,由ip和域名,通过算法,获得一个路径串,而这个路径串就可以映射到一个文件系统。这个文件系统就等价于一棵树,就是网络拓扑树。经历的越点越多,速度越慢。    

    如果本地没有资源,那就选择同一个机架(网络路径较短)上的空闲资源,即,机架本地化。仅在非常偶然的情况下,才会使用不同机架上的资源,导致网络传输。

    hadoop安装完成后,默认的机架感知实现是所有节点在一个机架,即,default rack。不管在哪个数据中心,哪个楼层,哪个机房,同一个集群节点都是在同一个机架上,这就需要我们自定义机架感知。

跨机架化

partition


如果只有一个reduce,所有的map输出都传给这一个reduce。如果有多个reduce,得指定这个key去哪个reduce,在做mr开发的,要保证reduce之间的平衡,避免数据倾斜。可以采取打撒数据,一种实现就是hash。设置reduce的节点是有的,conf.setNumReduceTasks(),

创建作业会产生job.xml和切片文件、切片元数据文件、jar文件、配置文件,这些文件全都要上传到hdfs,之所以上传到hdfs,是因为hdfs具有集群节点内数据是共享的特性,所有节点都可访问这些数据。上传后,资源管理器会计算作业执行的所需资源,返回执行作业的节点列表,根据执行节点列表运行作业。假如是10节点,那么这是个10节点都需要从hdfs上下载上传的作业文件到本地,如:java程序(jar文件),若jar文件不下载到本地是无法运行的,还有类路径等等。若不将作业执行的文件上传到hdfs,执行作业的节点是无法下载不到。其实,在本地通过local run运行的时候,其实它放在了临时路径下,后来又多了一个local临时目录,也是放在本地。针对conf的配置,其实就是针对job的配置。若有多个reduce,在每个map输出之前,每个map都要进行分区,每个map上的分区的个数和reduce的个数相同,并且分区的编号是和reduce对应的。所以,map在本地输出key-value的时候,不仅把输出的数据做一个按key的排序和分组,其实就是放在分区里面去。现在要进行shuffle,要往reduce传输数据。

如果设置的分区数量为4,当没有向第四个分区写入数据的时候,这是就是数据倾斜了。一般情况下,是按照key来分区的,如果key相同,不论如何hash,一定会在一个分区,也必然在一个分区才行,因为在reduce的时候要进行聚合。解决数据倾斜,就是重新定义key,再进行一轮去处理。

以上图中,指定combiner,传输到reduce的是1949=49,1950=45。combiner是取出其所在节点的最大值,不是所有节点上所有数据的最大值。但是combiner的使用需要根据具体业务,例如取平均值就不可以了。combiner是为了减少map和reduce之间的传输量。

Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to
specify a combiner function to be run on the map output, and the combiner function’s
output forms the input to the reduce function. Because the combiner function is an
optimization, Hadoop does not provide a guarantee of how many times it will call it for a
particular map output record, if at all. In other words, calling the combiner function zero,
one, or many times should produce the same output from the reducer.
              
The contract for the combiner function constrains the type of function that may be used.
This is best illustrated with an example. Suppose that for the maximum temperature
example, readings for the year 1950 were processed by two maps (because they were in
different splits). Imagine the first map produced the output:
    (1950, 0)
    (1950, 20)
    (1950, 10)
and the second produced:
    (1950, 25)
    (1950, 15)
The reduce function would be called with a list of all the values:
     (1950, [0, 20, 10, 25, 15])
with output:
     (1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like
the reduce function, finds the maximum temperature for each map output. The reduce
function would then be called with:
    (1950, [20, 25])
and would produce the same output as before. More succinctly, we may express the
function calls on the temperature values in this case as follows:
      max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Not all functions possess this property.[20] For example, if we were calculating mean
temperatures, we couldn’t use the mean as our combiner function, because:
     mean(0, 20, 10, 25, 15) = 14
but:
     mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
The combiner function doesn’t replace the reduce function. (How could it? The reduce
function is still needed to process records with the same key from different maps.) But it
can help cut down the amount of data shuffled between the mappers and the reducers, and
for this reason alone it is always worth considering whether you can use a combiner
function in your MapReduce job.

Specifying a combiner function

Going back to the Java MapReduce program, the combiner function is defined using the
Reducer class, and for this application, it is the same implementation as the reduce
function in MaxTemperatureReducer. The only change we need to make is to set the
combiner class on the Job (see Example 2-6).
Example 2-6. Application to find the maximum temperature, using a combiner function for
efficiency

package myhadoop;public class MaxTemperatureWithCombiner {public static void main(String[] args) throws Exception {if (args.length != 2) {System.err.println("Usage: MaxTemperatureWithCombiner <input path> "+ "<output path>");System.exit(-1);}Job job = new Job();job.setJarByClass(MaxTemperatureWithCombiner.class);job.setJobName("Max temperature");FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(MaxTemperatureMapper.class);job.setCombinerClass(MaxTemperatureReducer.class);job.setReducerClass(MaxTemperatureReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);System.exit(job.waitForCompletion(true) ? 0 : 1);}}