【MapReduce实例】单词统计

来源:互联网 发布:mac版ps怎么下载 编辑:程序博客网 时间:2024/05/20 09:48

一、实例描述

计算出文件中各个单词的频数,要求输出结果按照单词出现的频数进行降序。
比如,输入文件
file1.txt,其内容如下:

hello word bye world

file2.txt,其内容如下:

hello hadoop goodbye hadoop

对应上面给出的输入样例,其输出样例为:

2 hadoop
2 hello
2 world
1 bye
1 goodbye

二、设计思路

输出结果要求根据单词词频进行降序输出,常见的原生WordCount计数只能统计出单词及其出现的频率,因此,需要在原生WordCount基础上作进一步改进,使得结果按照单词词频降序输出,于是我们可以设计两个job来实现该需求。(1)job1实现单词的词频统计;(2)job2实现根据单词词频进行降序排序。

这里写图片描述
图1

这里写图片描述
图2

1. job1的处理过程如图1所示
(1)Map函数设计
Map函数的实现目的:
<1, hello world bye world> ——> <hello, 1>,<world, 1>, <bye, 1>, <world, 1>
Map结果的输出格式为

 public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        //每个单词出现的次数设置为1        private Text word = new Text();        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            //默认情况按空格分割字符串,即"hello world bye world"按空格分割            StringTokenizer itr = new StringTokenizer(value.toString());                    while (itr.hasMoreTokens()) {                word.set(itr.nextToken());                context.write(word, one);        //通过context对象写入<key, value>            }        }    }

(2)Reduce函数设计
Reduce函数的实现目的:
<hello, {1,1}>,<bye, {1}> ——> <hello, 2>,<bye, 1>

因此Reduce函数的设计如下:

  public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable result = new IntWritable();        public void reduce(Text key, Iterable<IntWritable> values, Context context)                throws IOException, InterruptedException {            int sum = 0;            //对于<key,value-list>中的value进行合并            for (IntWritable val : values) {    //对于key为"hello"而言,这里的values便是集合{1,1},对其值进行求和                sum += val.get();            }            result.set(sum);            context.write(key, result);        }    }

MapReduce默认的输出格式是按照key升序输出的,因此原生WordCount的输出结果如图1所示。

2. job2的处理过程如图2所示
(1)Map函数设计
Map函数的处理目的:
<bye,1>, <hello, 2> ——> <1, bye>, <2, hello>

可以看出,Map函数的目的是交换key和value的值,由于MapReduce内置有交换key和value的实现类InverseMapper,Map函数的设计直接设置Mapper类就行。这里还是贴出该类的具体设计。

public class InverseMapper<K, V> extends Mapper<K,V,V,K> {      /** The inverse function.  Input keys and values are swapped.*/      @Override      public void map(K key, V value, Context context                      ) throws IOException, InterruptedException {        context.write(value, key);      }}

(2)比较器的设计
Map函数设计完成后,MapReduce默认的按key进行升序排序的,即按词频由小到大升序的,因此我们设计一个比较器,按照key值进行降序处理。

比较器的设计如下:

   /*    *  定义一个比较器,进行降序排序    */    private static class IntWritableDescComparator extends IntWritable.Comparator {        public int compare(WritableComparable a, WritableComparable b) {            return -super.compare(a, b);        }        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {            return -super.compare(b1, s1, l1, b2, s2, l2);        }    }

(3)Reduce函数设计
由于job2只是对词频进行降序处理,而不需要通过Reduce函数对value-list进行汇总,于是可以不需要Reduce函数。

三、完整代码

package com.walker.mrdemo;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;import org.apache.hadoop.mapreduce.lib.map.InverseMapper;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;public class WordCount {    /*    * Mapper类    *     * param1:输入<key,value>对中key的类型    * param2:输入<key,value>对中value的类型    * param3:输出<key,value>对中key的类型    * param4:输出<key,value>对中value的类型    */ public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        //每个单词出现的次数设置为1        private Text word = new Text();        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            // 默认情况按空格分割字符串,即"hello world bye world"按空格分割            StringTokenizer itr = new StringTokenizer(value.toString());                    while (itr.hasMoreTokens()) {                word.set(itr.nextToken());                context.write(word, one);        //通过context对象写入<key, value>            }        }    }    /*    * Reducer类    */    public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable result = new IntWritable();        public void reduce(Text key, Iterable<IntWritable> values, Context context)                throws IOException, InterruptedException {            int sum = 0;            //对于<key,value-list>中的value进行合并            for (IntWritable val : values) {    //对于key为"hello"而言,这里的values便是集合{1,1},对其值进行求和                sum += val.get();            }            result.set(sum);            context.write(key, result);        }    }    /*    *  定义一个比较器,进行降序排序    */    private static class IntWritableDescComparator extends IntWritable.Comparator {        public int compare(WritableComparable a, WritableComparable b) {            return -super.compare(a, b);        }        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {            return -super.compare(b1, s1, l1, b2, s2, l2);        }    }    // 输入输出路径设置    private static final String FILE_IN_PATH = "hdfs://192.168.50.130:9000/mrdemo/WordCount/input";    private static final String FILE_OUT_PATH = "hdfs://192.168.50.130:9000/mrdemo/WordCount/output";    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        //设置中间临时文件存储路径        Path tempDir = new Path("hdfs://192.168.50.130:9000/mrdemo/tmp");        /*        * job1:统计出单词出现的次数        */        Job job = Job.getInstance(conf, "WordCout");        //设置Mapper和Reducer类        job.setJarByClass(WordCount.class);        job.setMapperClass(WordCountMapper.class);        job.setReducerClass(WordCountReducer.class);        //设置Reducer输出的key和value的类型        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        job.setOutputFormatClass(SequenceFileOutputFormat.class);        //设置数据输入输出路径        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));        FileOutputFormat.setOutputPath(job, tempDir);        //表示结束了才退出,不结束不退出        job.waitForCompletion(true);        /*        * job2:根据单词出现的频数进行降序排序        */        //配置Job相关信息        Job sortjob = Job.getInstance(conf, "sortJob");        sortjob.setInputFormatClass(SequenceFileInputFormat.class);        //使用MapReduce内置的InverseMapper类,交换key和value        sortjob.setMapperClass(InverseMapper.class);        sortjob.setNumReduceTasks(1);        //设置输出的key,value类型        sortjob.setOutputKeyClass(IntWritable.class);        sortjob.setOutputValueClass(Text.class);        //指定排序的比较器        sortjob.setSortComparatorClass(IntWritableDescComparator.class);        //设置输入输出路径        FileInputFormat.addInputPath(sortjob, tempDir);        FileOutputFormat.setOutputPath(sortjob, new Path(FILE_OUT_PATH));        sortjob.waitForCompletion(true);        //删除创建的临时目录        FileSystem.get(conf).delete(tempDir);        System.exit(0);    }}
原创粉丝点击