hadoop学习1 实现WordCount

来源：互联网发布：武汉java培训机构编辑：程序博客网时间：2024/05/21 03:28

想了解hadoop的基本原理，就要从WordCount开始，now let's go
1. map类
map方法有4个参数，分别为向map方法输入的key/value和向reduce方法输出的key/value

import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.IntWritable;public class MapperClass  extends Mapper<Object, Text, Text, IntWritable>{                    //Test 相当于 String. IntWritable 相当于 intpublic Text keyText = new Text("key");                                                                                          //String keyText = "key";public IntWritable intvalue = new IntWritable(1); @Overrideprotected void map(Object key, Text value, Context context)throws IOException, InterruptedException {//step1 get valueString str = value.toString();//step2  用空格分开，StringTokenizer类默认用空格分开StringTokenizer stringTokenizer = new StringTokenizer(str);        while (stringTokenizer.hasMoreTokens()) {        keyText.set(stringTokenizer.nextToken());//key          context.write(keyText, intvalue);  //       output key/value    即context.write("My",1);  }                       } }2.reduce类   reduce方法也有4个参数，分别为map向reduce输入的key/balue和reduce要输出的key/valueimport java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;public class ReduceClass extends Reducer<Text, IntWritable, Text, IntWritable>{public IntWritable IntValue  = new IntWritable(0);@Overrideprotected void reduce(Text key, Iterable<IntWritable> values,            //name 2  name [1,1]Context context)throws IOException, InterruptedException {//step1int sum = 0;while(values.iterator().hasNext()){sum += values.iterator().next().get();}IntValue.set(sum);context.write(key, IntValue);}}3 WordCount类   在WordCount类中要对map类和reduce类做调度，并格式化输出结果import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.input.*;import org.apache.hadoop.mapreduce.lib.output.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;public class WordCount {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Path in = new Path(args[0]);Path out = new Path(args[1]);if(args.length != 2){System.out.println("Usage: wordcount<in><out>");System.exit(2);}Job job = new Job(conf,"WordCount");job.setJarByClass(WordCount.class);job.setMapperClass(MapperClass.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, in);FileOutputFormat.setOutputPath(job, out);System.exit(job.waitForCompletion(true)? 0 : 1);}}

运行方法：

在shell界面输入

$ hadoop jar WordCount.jar com.itcast.hadoop.mapreduce.WordCount /user/hadoop/WordCount/WC.txt /user/hadoop/WordCount/output

输入文件：

ps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

输出文件：

A 1

Maps 1

The 1

a 2

are 1

as 1

be 1

given 1

individual 1

input 4

intermediate 3

into 1

key/value 2

many 1

map 1

may 1

need 1

not 1

of 2

or 1

output 1

pair 1

pairs 1

pairs. 2

ps 1

records 2

records. 2

same 1

set 1

tasks 1

the 3

to 2

transform 1

transformed 1

type 1

which 1

zero 1

不足：

因为使用了StringTokenizer中的默认方法，所以程序没有把pairs.这种形式当做pairs。

0 0