大数据-Hadoop-MapReduce (二) WrodCount单词计算

来源：互联网发布：暴风影音mac加载字幕编辑：程序博客网时间：2024/06/04 00:32

Hadoop-MapReduce (二) -WrodCount单词计算

一句话理解: 将很多很多的文本文件遍历计算出每一个单词出现的次数
-扩展阅读TF-IDF词频-逆向文档频率

(WordCount).单词计算

有文本如下:

a b c

b b c

c d c

需得到结果为:

a 1

b 3

c 4

d 1

原理如图:

1)Map 将每一行的单词计数为1 Map<word,1>

// 输入为一行行的数据 其中 LongWritable key为下标，Text value 为这一行文本// 假设这一行数据为 b c d e e e epublic static class TokenizerMapper extends Mapper {protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)throws IOException, InterruptedException {String lineStr = value.toString();// 得到一行文本// 使用空格分离 默认参数为空格StringTokenizer words = new StringTokenizer(lineStr);while (words.hasMoreElements()) {String word = words.nextToken();// 得到这个单词//if(word.contains("Maturity"))// 交这个单词计数+1context.write(new Text(word), new IntWritable(1));// 输出到map}}}

2)Shuffling 对每一个单词进行分类合并 Map<word,<1,1>>

3)Reduce 对每一个单词累加 word = 1 + 1

// input e1 e1 e1 e1// output e4//public static class IntSumReducer extends Reducer {    public static class IntSumReducer extends Reducer {        public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {        int count = 0;// String word = key.toString();for (IntWritable intWritable : values) {// 循环count += intWritable.get();}// 输出context.write(key, new IntWritable(count));        }    }
4)Job运算
public class WordCount {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String inputPath = "input/wordcount";String outputPath = "output/wordcount";// String[] otherArgs = (new GenericOptionsParser(conf,// args)).getRemainingArgs();String[] otherArgs = new String[] { inputPath, outputPath }; /* 直接设置输入参数 */// delete outputPath outputPath2 = new Path(outputPath);outputPath2.getFileSystem(conf).delete(outputPath2, true);// runif (otherArgs.length < 2) {System.err.println("Usage: wordcount  [...] ");System.exit(2);}Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(WordCount.TokenizerMapper.class);//job.setCombinerClass(WordCount.IntSumReducer.class);job.setReducerClass(WordCount.IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//output file total//job.setNumReduceTasks(1);//reducer task num  for (int i = 0; i < otherArgs.length - 1; ++i) {FileInputFormat.addInputPath(job, new Path(otherArgs[i]));}FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}

转载请注明出处,谢谢!

阅读全文

0 0