MapReduce入门级之WordCount单词计数

来源:互联网 发布:大数据风险 编辑:程序博客网 时间:2024/04/28 13:00

话不多说直接贴上代码:具体的实现代码后面描述

package com.whomai.test;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount{public static class mapWordCount extends Mapper<Object,Text,Text,IntWritable>{private  final static IntWritable one = new IntWritable(1);private Text word =  new Text(); public void map(Object key,Text value,Context context) throws IOException, InterruptedException{ String line =value.toString(); StringTokenizer st = new StringTokenizer(line); while(st.hasMoreTokens()){ word.set(st.nextToken()); context.write(word, one); } }}public static class reduceWordCount extends Reducer<Text,IntWritable,Text,IntWritable>{public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{int sum = 0;for(IntWritable val : values){sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{Configuration conf = new Configuration();Job job = new Job(conf,"word");job.setJarByClass(WordCount.class); String wordCountInput = "hdfs://192.168.248.133:9000/wordCountInput"; String WordCountOut="hdfs://192.168.248.133:9000/WordCountOutPath"; job.setMapperClass(mapWordCount.class); job.setCombinerClass(reduceWordCount.class); job.setReducerClass(reduceWordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(wordCountInput)); FileOutputFormat.setOutputPath(job, new Path(WordCountOut)); System.exit(job.waitForCompletion(true) ? 0 : 1);}}


首先hadoop是基于分布式文件系统hdfs和分布式处理MapReduce的。如字面,Mapreduce包含了Map操作和Reduce操作。
Map操作就是拆分,将大量繁重的任务分散给各个节点。那么reduce就是将map的任务进行整合。
在代码中我们实现了map和reduce的内部类。Map类继承了hadoop2.0中的mapper类。里面传递了四个参数,分别是传入文件时的key,value类型和,map输出时的类型。
这样来说,

mapreduce的基本思想就是这些,各个程序相辅相成,比如说去重,其主要涵义就是在在reduce的过程中将value的值设置为1就行了。


0 0