MapReduce之计数器及实例

来源：互联网发布：软件腰带剑多少钱编辑：程序博客网时间：2024/05/22 05:32

MapReduce之计数器及实例
http://www.aboutyun.com/thread-13745-1-1.html

感兴趣的直接点上面链接，会有更详细的解析

问题导读
1.hadoop有哪些内置计数器？
2.job.getCounters()可以得到什么？
3.MapReduce是否允许用户自定义计数器？

简述：

Hadoop计数器：可以让开发人员以全局的视角来审查相关作业的运行情况以及各项指标，及时做出错误诊断并进行相应处理。
相比而言，计数器方式比日志更易于分析。

内置计数器：

（1）Hadoop内置的计数器，主要用来记录作业的执行情况
（2）内置计数器包括如下：
—MapReduce框架计数器（Map-Reduce Framework）
—文件系统计数器（File System Counters）
—作业计数器（Job Counters）
—文件输入格式计数器（File Output Format Counters）
—文件输出格式计数器（File Input Format Counters）
—Shuffle 错误计数器（Shuffle Errors）
（3）计数器由相关的task进行维护，定期传递给tasktracker，再由tasktracker传给jobtracker；
（4）最终的作业计数器实际上是由jobtracker维护，所以计数器可以被全局汇总，同时也不必在整个网络中传递。
（5）只有当一个作业执行成功后，最终的计数器的值才是完整可靠的；

[Bash shell] 纯文本查看复制代码

01内置计数器：
0215/06/15 08:46:47 INFO mapreduce.Job: Job job_1434248323399_0004 completed successfully
0315/06/15 08:46:47 INFO mapreduce.Job: Counters: 49
04        File System Counters
05                FILE: Number of bytes read=103
06                FILE: Number of bytes written=315873
07                FILE: Number of read operations=0
08                FILE: Number of large read operations=0
09                FILE: Number of write operations=0
10                HDFS: Number of bytes read=116
11                HDFS: Number of bytes written=40
12                HDFS: Number of read operations=9
13                HDFS: Number of large read operations=0
14                HDFS: Number of write operations=4
15        Job Counters
16                Launched map tasks=1
17                Launched reduce tasks=2
18                Data-local map tasks=1
19                Total time spent by all maps in occupied slots (ms)=2893
20                Total time spent by all reduces in occupied slots (ms)=6453
21                Total time spent by all map tasks (ms)=2893
22                Total time spent by all reduce tasks (ms)=6453
23                Total vcore-seconds taken by all map tasks=2893
24                Total vcore-seconds taken by all reduce tasks=6453
25                Total megabyte-seconds taken by all map tasks=2962432
26                Total megabyte-seconds taken by all reduce tasks=6607872
27        Map-Reduce Framework
28                Map input records=7
29                Map output records=7
30                Map output bytes=77
31                Map output materialized bytes=103
32                Input split bytes=95
33                Combine input records=0
34                Combine output records=0
35                Reduce input groups=2
36                Reduce shuffle bytes=103
37                Reduce input records=7
38                Reduce output records=2
39                Spilled Records=14
40                Shuffled Maps =2
41                Failed Shuffles=0
42                Merged Map outputs=2
43                GC time elapsed (ms)=59
44                CPU time spent (ms)=3600
45                Physical memory (bytes) snapshot=606015488
46                Virtual memory (bytes) snapshot=2672865280
47                Total committed heap usage (bytes)=602996736
48         Shuffle Errors
49                BAD_ID=0
50                CONNECTION=0
51                IO_ERROR=0
52                WRONG_LENGTH=0
53                WRONG_MAP=0
54                WRONG_REDUCE=0
55 
56                BAD_ID=0
57                CONNECTION=0
58                IO_ERROR=0
59                WRONG_LENGTH=0
60                WRONG_MAP=0
61                WRONG_REDUCE=0
62        File Input Format Counters
63                Bytes Read=21
64        File Output Format Counters
65                Bytes Written=40

计数器使用：

1、Web UI进行查看
（注：要启动历史服务器）

2、命令行方式：
hadoop job -counter（Hadoop2.x无效）

3、使用Hadoop API
通过job.getCounters()得到Counters,而后调用counters.findCounter()方法去得到计数器对象；查看最终的计数器的值需要等作业完成之后。

自定义计数器及实例：

MapReduce允许用户自定义计数器,计数器是一个全局变量,计数器有组的概念，可以用Java的枚举类型或者用字符串来定义方法；

[Java] 纯文本查看复制代码

01package org.apache.hadoop.mapreduce;
02public interface TaskAttemptContext extends JobContext, Progressable {
03    //Get the {@link Counter} for the given
04    //<code>counterName</code>.
05    public Counter getCounter(Enum<?> counterName);
06 
07    //Get the {@link Counter} for the given
08    //<code>groupName</code> and <code>counterName</code>.
09    public Counter getCounter(String groupName, String counterName);
10}

字符串方式（动态计数器）比枚举类型要更加灵活，可以动态在一个组下面添加多个计数器；在旧API中使用Reporter，而新API用context.getCounter(groupName,counterName)来获取计数器配置并设置；然后让计数器递增。

[Java] 纯文本查看复制代码

01package org.apache.hadoop.mapreduce;
02/**
03 * A named counter that tracks the progress of a map/reduce job.
04 * <p><code>Counters</code> represent global counters, defined either by the
05 * Map-Reduce framework or applications. Each <code>Counter</code> is named by
06 * an {@link Enum} and has a long for the value.</p>
07 * <p><code>Counters</code> are bunched into Groups, each comprising of
08 * counters from a particular <code>Enum</code> class.
09 */
10public interface Counter extends Writable {
11  /**
12   * Increment this counter by the given value
13   * @param incr the value to increase this counter by
14   */
15  void increment(long incr);
16}

自定义计数器实例
统计词汇行中词汇数超过2个或少于2个的行数：
输入数据文件counter

[Bash shell] 纯文本查看复制代码

01[root@liguodong file]# vi counter
02[root@liguodong file]# hdfs dfs -put counter /counter
03[root@liguodong file]# hdfs dfs -cat /counter
04hello world
05hello hadoop
06hi baby
07hello 4325 7785993
08java hadoop
09come

[Java] 纯文本查看复制代码

01package MyCounter;
02 
03import java.io.IOException;
04import java.net.URI;
05import java.net.URISyntaxException;
06 
07import org.apache.hadoop.conf.Configuration;
08import org.apache.hadoop.fs.FileSystem;
09import org.apache.hadoop.fs.Path;
10import org.apache.hadoop.io.IntWritable;
11import org.apache.hadoop.io.LongWritable;
12import org.apache.hadoop.io.Text;
13import org.apache.hadoop.mapreduce.Job;
14import org.apache.hadoop.mapreduce.Mapper;
15import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
16import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
17 
18import MyPartitioner.MyPartitioner;
19import MyPartitioner.MyPartitioner.DefPartitioner;
20import MyPartitioner.MyPartitioner.MyMapper;
21import MyPartitioner.MyPartitioner.MyReducer;
22 
23public class MyCounter {
24    private final static String INPUT_PATH = "hdfs://liguodong:8020/counter";
25    private final static String OUTPUT_PATH = "hdfs://liguodong:8020/outputcounter";
26    public static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text>
27    {
28        @Override
29        protected void map(LongWritable key, Text value, Context context)
30                throws IOException, InterruptedException
31        {  
32            String[] val = value.toString().split("\\s+");
33            if(val.length<2){
34                context.getCounter("ErrorCounter","below_2").increment(1);
35            }else if(val.length>2){
36                context.getCounter("ErrorCounter", "above_2").increment(1);
37            }
38            context.write(key, value);
39        }
40    }
41 
42    public static void main(String[] args) throws IllegalArgumentException, IOException,
43    URISyntaxException, ClassNotFoundException, InterruptedException {
44        Configuration conf = new Configuration();
45        final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH),conf);
46        if(fileSystem.exists(new Path(OUTPUT_PATH)))
47        {
48            fileSystem.delete(new Path(OUTPUT_PATH),true);
49        }
50        Job job = Job.getInstance(conf, "define counter");
51 
52        job.setJarByClass(MyPartitioner.class);
53 
54        FileInputFormat.addInputPath(job, new Path(INPUT_PATH)); 
55        job.setMapperClass(MyMapper.class);
56 
57        job.setNumReduceTasks(0);
58 
59        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));    
60        //提交作业
61        System.exit(job.waitForCompletion(true) ? 0 : 1);      
62    }
63}

[Bash shell] 纯文本查看复制代码

view source
print?
01运行结果：
02[main] INFO org.apache.hadoop.mapreduce.Job - Counters: 25
03    File System Counters
04        FILE: Number of bytes read=148
05        FILE: Number of bytes written=187834
06        FILE: Number of read operations=0
07        FILE: Number of large read operations=0
08        FILE: Number of write operations=0
09        HDFS: Number of bytes read=69
10        HDFS: Number of bytes written=86
11        HDFS: Number of read operations=8
12        HDFS: Number of large read operations=0
13        HDFS: Number of write operations=3
14    Map-Reduce Framework
15        Map input records=6
16        Map output records=6
17        Input split bytes=94
18        Spilled Records=0
19        Failed Shuffles=0
20        Merged Map outputs=0
21        GC time elapsed (ms)=12
22        CPU time spent (ms)=0
23        Physical memory (bytes) snapshot=0
24        Virtual memory (bytes) snapshot=0
25        Total committed heap usage (bytes)=16252928
26    ErrorCounter
27        above_2=1
28        below_2=1
29    File Input Format Counters
30        Bytes Read=69
31    File Output Format Counters
32        Bytes Written=86

0 0