Hadoop日记Day17---计数器、map规约、分区学习

来源：互联网发布：文件加密锁软件编辑：程序博客网时间：2024/05/11 06:00

一、Hadoop计数器

1.1 什么是Hadoop计数器

　　Haoop是处理大数据的，不适合处理小数据，有些大数据问题是小数据程序是处理不了的，他是一个高延迟的任务，有时处理一个大数据需要花费好几个小时这都是正常的。下面我们说一下Hadoop计数器，Hadoop计数器就相当于我们的日志，而日志可以让我们查看程序运行时的很多状态，而计数器也有这方面的作用。那么就研究一下Hadoop自身的计数器。计数器的程序如代码1.1所示，下面代码还是以内容为“hello you；hell0 me”的单词统计为例。

 View Code

代码 1.1

　　运行结果如下图1.1所示。

Counters: 19//Counter表示计数器，19表示有19个计数器（下面一共4计数器组）   File Output Format Counters //文件输出格式化计数器组     Bytes Written=19       //reduce输出到hdfs的字节数，一共19个字节   FileSystemCounters//文件系统计数器组     FILE_BYTES_READ=481     HDFS_BYTES_READ=38     FILE_BYTES_WRITTEN=81316     HDFS_BYTES_WRITTEN=19   File Input Format Counters //文件输入格式化计数器组     Bytes Read=19     //map从hdfs读取的字节数   Map-Reduce Framework//MapReduce框架     Map output materialized bytes=49     Map input records=2       //map读入的记录行数，读取两行记录,”hello you”,”hello me”     Reduce shuffle bytes=0//规约分区的字节数     Spilled Records=8     Map output bytes=35     Total committed heap usage (bytes)=266469376     SPLIT_RAW_BYTES=105     Combine input records=0//合并输入的记录数     Reduce input records=4     //reduce从map端接收的记录行数     Reduce input groups=3     //reduce函数接收的key数量，即归并后的k2数量     Combine output records=0//合并输出的记录数     Reduce output records=3    //reduce输出的记录行数。<helllo,{1,1}>,<you,{1}>,<me,{1}>     Map output records=4     //map输出的记录行数，输出4行记录

图 1.1

　　通过上面我们对计数器的分析，可以知道，我们可以通过计数器来分析MapReduece程序的运行状态。

1.2 自定义计数器

　　通过上面的分析，我们了解了计数器的作用，那么我们可以自定义一个计数器，来实现我们自己想要的功能。如定义一个记录敏感词的计数器，记录敏感词在一行所出现的次数，如代码2.1所示。我们处理文件内容为“hello you”，“hello me”。

 View Code

代码2.1

运行结果如下图2.1所示。

 Counters: 20   Sensitive Words     hello=2   File Output Format Counters      Bytes Written=21   FileSystemCounters     FILE_BYTES_READ=359     HDFS_BYTES_READ=42     FILE_BYTES_WRITTEN=129080     HDFS_BYTES_WRITTEN=21   File Input Format Counters      Bytes Read=21   Map-Reduce Framework     Map output materialized bytes=67     Map input records=2     Reduce shuffle bytes=0     Spilled Records=8     Map output bytes=53     Total committed heap usage (bytes)=391774208     SPLIT_RAW_BYTES=95     Combine input records=0     Reduce input records=4     Reduce input groups=3     Combine output records=0     Reduce output records=3     Map output records=4

图 2.1

二、Combiners编程

2.1 什么是Combiners

　　从上面程序运行的结果我们可以发现，在Map-Reduce Framework即MapReduce框架的输出中，Combine input records这个字段为零，那么combine怎么使用呢？其实这是MapReduce程序中Mapper任务中第五步，这是可选的一步，使用方法非常简单，以上面单词统计为例，只需添加下面一行代码即可，如下： job.setCombinerClass(MyReducer.class);

　　combine操作是一个可选的操作，使用时需要我们自己设定，我们用MyReducer类来设置Combiners，表示Combiners与Reduce功能相同，带有combine功能的MapRduce程序如代码3.1所示。

 View Code

代码 3.1

　　运行结果如下图3.1所示。

Counters: 20   Sensitive Words     hello=2   File Output Format Counters      Bytes Written=21   FileSystemCounters     FILE_BYTES_READ=359     HDFS_BYTES_READ=42     FILE_BYTES_WRITTEN=129080     HDFS_BYTES_WRITTEN=21   File Input Format Counters      Bytes Read=21   Map-Reduce Framework     Map output materialized bytes=67     Map input records=2     Reduce shuffle bytes=0     Spilled Records=8     Map output bytes=53     Total committed heap usage (bytes)=391774208     SPLIT_RAW_BYTES=95     Combine input records=4     Reduce input records=3     Reduce input groups=3     Combine output records=3     Reduce output records=3     Map output records=4

图 3.1

　　从上面的运行结果我们可以发现，此时Combine input records=4，Combine output records=3，Reduce input records=3，因为Combine阶段在Ma pper结束与Reducer开始之间，Combiners处理的数据，就是在不设置Combiners时，Reduce所应该接受的数据，所以为4，然后再将Combiners的输出作为Re duce端的输入，所以Reduce input records这个字段由4变成了3。注意，combine操作是一个可选的操作，使用时需要我们自己设定，在本代码中我们用MyRed ucer类来设置Combiners，Combine方法的使用的是Reduce的方法，这说明归约的方法是通用的，Reducer阶段的方法也可以用到Mapper阶段。

2.1 自定义Combiners

　　为了能够更加清晰的理解Combiners的工作原理，我们自定义一个Combiners类，不再使用MyReduce做为Combiners的类，如代码3.2所示。

 View Code

代码 3.2

运行结果如图3.2所示。

14/10/07 18:56:32 INFO mapred.MapTask: record buffer = 262144/327680Mapper输出<hello,1>14/10/07 18:56:32 INFO mapred.MapTask: Starting flush of map outputMapper输出<world,1>Mapper输出<hello,1>Mapper输出<me,1>Combiner输入分组<hello,...>Combiner输入键值对<hello,1>Combiner输入键值对<hello,1>Combiner输出键值对<hello,2>Combiner输入分组<me,...>Combiner输入键值对<me,1>Combiner输出键值对<me,1>Combiner输入分组<world,...>Combiner输入键值对<world,1>Combiner输出键值对<world,1>14/10/07 18:56:32 INFO mapred.MapTask: Finished spill 014/10/07 18:56:32 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting14/10/07 18:56:32 INFO mapred.LocalJobRunner: 14/10/07 18:56:32 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.14/10/07 18:56:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : null14/10/07 18:56:32 INFO mapred.LocalJobRunner: 14/10/07 18:56:32 INFO mapred.Merger: Merging 1 sorted segments14/10/07 18:56:32 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 47 bytes14/10/07 18:56:32 INFO mapred.LocalJobRunner: MyReducer输入分组<hello,...>MyReducer输入键值对<hello,2>MyReducer输入分组<me,...>MyReducer输入键值对<me,1>MyReducer输入分组<world,...>MyReducer输入键值对<world,1>14/10/07 18:56:33 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting14/10/07 18:56:33 INFO mapred.LocalJobRunner: 14/10/07 18:56:33 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now14/10/07 18:56:33 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://hadoop:9000/output14/10/07 18:56:33 INFO mapred.LocalJobRunner: reduce > reduce14/10/07 18:56:33 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.14/10/07 18:56:33 INFO mapred.JobClient:  map 100% reduce 100%14/10/07 18:56:33 INFO mapred.JobClient: Job complete: job_local_000114/10/07 18:56:33 INFO mapred.JobClient: Counters: 1914/10/07 18:56:33 INFO mapred.JobClient:   File Output Format Counters 14/10/07 18:56:33 INFO mapred.JobClient:     Bytes Written=2114/10/07 18:56:33 INFO mapred.JobClient:   FileSystemCounters14/10/07 18:56:33 INFO mapred.JobClient:     FILE_BYTES_READ=34314/10/07 18:56:33 INFO mapred.JobClient:     HDFS_BYTES_READ=4214/10/07 18:56:33 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=12957214/10/07 18:56:33 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2114/10/07 18:56:33 INFO mapred.JobClient:   File Input Format Counters 14/10/07 18:56:33 INFO mapred.JobClient:     Bytes Read=2114/10/07 18:56:33 INFO mapred.JobClient:   Map-Reduce Framework14/10/07 18:56:33 INFO mapred.JobClient:     Map output materialized bytes=5114/10/07 18:56:33 INFO mapred.JobClient:     Map input records=214/10/07 18:56:33 INFO mapred.JobClient:     Reduce shuffle bytes=014/10/07 18:56:33 INFO mapred.JobClient:     Spilled Records=614/10/07 18:56:33 INFO mapred.JobClient:     Map output bytes=5314/10/07 18:56:33 INFO mapred.JobClient:     Total committed heap usage (bytes)=39177420814/10/07 18:56:33 INFO mapred.JobClient:     SPLIT_RAW_BYTES=9514/10/07 18:56:33 INFO mapred.JobClient:     Combine input records=414/10/07 18:56:33 INFO mapred.JobClient:     Reduce input records=314/10/07 18:56:33 INFO mapred.JobClient:     Reduce input groups=314/10/07 18:56:33 INFO mapred.JobClient:     Combine output records=314/10/07 18:56:33 INFO mapred.JobClient:     Reduce output records=314/10/07 18:56:33 INFO mapred.JobClient:     Map output records=4

图 3.2

　　从上面的运行结果我们可以得知，combine具体作用如下：

每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。
combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。
如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。

　　注意：Combiner的输出是Reducer的输入，Combiner绝不能改变最终的计算结果。所以从我的想法来看，Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等。

解释一下

*问：为什么使用Combiner？
   答：Combiner发生在Map端，对数据进行规约处理，数据量变小了，传送到reduce端的数据量变小了，传输时间变短，作业的整体时间变短。
* 问：为什么Combiner不作为MR运行的标配，而是可选步骤？
    答：因为不是所有的算法都适合使用Combiner处理，例如求平均数。
* 问：Combiner本身已经执行了reduce操作，为什么在Reducer阶段还要执行reduce操作？
    答：combiner操作发生在map端的，智能处理一个map任务中的数据，不能跨map任务执行；只有reduce可以接收多个map任务处理的数据。

三、Partitioner编程

4.1 什么是分区

　　在MapReuce程序中的Mapper任务的第三步就是分区，那么分区到底是干什么的呢？其实，把数据分区是为了更好的利用数据，根据数据的属性不同来分成不同区，再根据不同的分区完成不同的任务。MapReduce程序中他的默认分区是1个分区，我们看一下默认分区的代码，还是以单词统计为例如代码4.1所示。

 View Code

代码 4.1

　　在MapReduce程序中默认的分区方法为HashPartitioner，代码job.setNumReduceTasks(1)表示运行的Reduce任务数，他会将numReduceTask这个变量设为1. HashPartitioner继承自Partitioner，Partitioner是Partitioner的基类，如果需要定制partitioner也需要继承该类。 HashPartitioner计算方法如代码4.2所示。

1 public class HashPartitioner<K, V> extends Partitioner<K, V> {2 3   /** Use {@link Object#hashCode()} to partition. */4   public int getPartition(K key, V value,5                           int numReduceTasks) {6     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;7   }8 9 }

代码 4.2

　　在上面的代码中K和V，表示k2和v2,该类中只有一个方法getPartition(),返回值如下”(key.hashCode()& Integer.MAX_VALUE)%numReduceTasks“其中key.hashCode()表示该关键是否属于该类。numReduceTasks的值在上面代码中设置为1，取模后只有一种结果那就是0。getPartition()的意义就是表示划分到不同区域的一个标记，返回0，就是表示划分到第0区，所以我们可以把它理解分区的下标，来代表不同的分区。

4.2 自定义分区

　　下面我们尝试自定义一个分区，来处理一下手机的日志数据（在前面学习中用过），手机日志数据如下图4.1所示。

图 4.1

　　从图中我们可以发现，在第二列上并不是所有的数据都是手机号，我们任务就是在统计手机流量时，将手机号码和非手机号输出到不同的文件中。我们的分区是按手机和非手机号码来分的，所以我们可以按该字段的长度来划分，如代码4.3所示。

 View Code

代码 4.3

　　注意：分区的例子必须打成jar运行,运行结果如下图4.3,4.4所示,4.3表示手机号码流量，4.4为非手机号流量。

0 0