MapReduce程序初探 -------------- WordCount

来源：互联网发布：中信建投证券待遇知乎编辑：程序博客网时间：2024/05/21 15:00

程序代码

package test;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {  public static class TokenizerMapper       extends Mapper<Object, Text, Text, IntWritable>{    private final static IntWritable one = new IntWritable(1);    private Text word = new Text();    /*     * LongWritable 为输入的key的类型     * Text 为输入value的类型     * Text-IntWritable 为输出key-value键值对的类型     */    public void map(Object key, Text value, Context context                    ) throws IOException, InterruptedException {      StringTokenizer itr = new StringTokenizer(value.toString());  // 将TextInputFormat生成的键值对转换成字符串类型      while (itr.hasMoreTokens()) {        word.set(itr.nextToken());        context.write(word, one);      }    }  }  public static class IntSumReducer       extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result = new IntWritable();    /*     * Text-IntWritable 来自map的输入key-value键值对的类型     * Text-IntWritable 输出key-value 单词-词频键值对     */    public void reduce(Text key, Iterable<IntWritable> values,                       Context context                       ) throws IOException, InterruptedException {      int sum = 0;      for (IntWritable val : values) {        sum += val.get();      }      result.set(sum);      context.write(key, result);    }  }  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();  // job的配置    Job job = Job.getInstance(conf, "word count");  // 初始化Job    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);      job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    FileInputFormat.addInputPath(job, new Path(args[0]));  // 设置输入路径    FileOutputFormat.setOutputPath(job, new Path(args[1]));  // 设置输出路径    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}

创建目录

[root@Vm90 wxl]# lsoutput  wordcount01[root@Vm90 wxl]# mkdir wordcount01[root@Vm90 wxl]# cd wordcount01[root@Vm90 wxl]# mkdir src[root@Vm90 wxl]# mkdir classes  [root@Vm90 wordcount01]# lsclasses  src[root@Vm90 wordcount01]# [root@Vm90 wordcount01]# cd src/[root@Vm90 wordcount01]# vim WordCount.java#将上述代码粘贴到WordCount.java中#然后执行编译[root@Vm90 src]# cd ..[root@Vm90 wordcount01]# lsclasses  src#编译需要引用三个jar包    hadoop-common-2.6.0.jar    hadoop-mapreduce-client-core-2.6.0.jar    hadoop-test-1.2.1.jar#根据本身hadoop版本自行选取jar包，一下为本实验用到的jar包    hadoop-common-2.6.0-cdh5.5.0.jar    hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar    hadoop-test-2.6.0-mr1-cdh5.5.0.jar[root@Vm90 wordcount01]# javac -Xlint:deprecation -classpath /opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.5.0.jar -d classes/ src/*.java-------输出为（不用在意警告）：/opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar(org/apache/hadoop/fs/Path.class): 警告: 无法找到类型 'LimitedPrivate' 的注释方法 'value()': 找不到org.apache.hadoop.classification.InterfaceAudience的类文件1 个警告#打jar包[root@Vm90 wordcount01]# jar -cvf wordcount.jar classes/* 已添加清单正在添加: classes/test/(输入 = 0) (输出 = 0)(存储了 0%)正在添加: classes/test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)正在添加: classes/test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)正在添加: classes/test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)[root@Vm90 wordcount01]# lsclasses  src  wordcount.jar[root@Vm90 wordcount01]##上次测试数据[root@Vm90 input]# cat 2.txt hello hadoopbye hadoopgood javagreat pytho [root@Vm90 wxl]# lsinput  output  wordcount01[root@Vm90 wxl]# hadoop fs -put input /hbase/

此时运行程序会报错

[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output1117/06/22 14:29:48 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:803217/06/22 14:29:49 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.17/06/22 14:29:49 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).17/06/22 14:29:49 INFO input.FileInputFormat: Total input paths to process : 117/06/22 14:29:49 INFO mapreduce.JobSubmitter: number of splits:117/06/22 14:29:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_000817/06/22 14:29:50 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.17/06/22 14:29:50 INFO impl.YarnClientImpl: Submitted application application_1497340925516_000817/06/22 14:29:50 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0008/17/06/22 14:29:50 INFO mapreduce.Job: Running job: job_1497340925516_000817/06/22 14:29:58 INFO mapreduce.Job: Job job_1497340925516_0008 running in uber mode : false17/06/22 14:29:58 INFO mapreduce.Job:  map 0% reduce 0%17/06/22 14:30:03 INFO mapreduce.Job: Task Id : attempt_1497340925516_0008_m_000000_0, Status : FAILEDError: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)    at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)Caused by: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)    ... 8 more

原因是：
这里写图片描述

路径问题：报的是 test.WordCount$TokenizerMapper not found
而打的jar包时 classes/test/
重新打包：

[root@Vm90 wordcount01]# cd classes/[root@Vm90 classes]# lstest[root@Vm90 classes]# jar -cvf ../wordcount.jar test已添加清单正在添加: test/(输入 = 0) (输出 = 0)(存储了 0%)正在添加: test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)正在添加: test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)正在添加: test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)[root@Vm90 classes]# lstest[root@Vm90 classes]# cd ..[root@Vm90 wordcount01]# lsclasses  src  wordcount.jar[root@Vm90 wordcount01]# #再次运行[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output1217/06/22 14:38:15 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:803217/06/22 14:38:16 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.17/06/22 14:38:16 INFO input.FileInputFormat: Total input paths to process : 117/06/22 14:38:16 INFO mapreduce.JobSubmitter: number of splits:117/06/22 14:38:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_000917/06/22 14:38:17 INFO impl.YarnClientImpl: Submitted application application_1497340925516_000917/06/22 14:38:17 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0009/17/06/22 14:38:17 INFO mapreduce.Job: Running job: job_1497340925516_000917/06/22 14:38:24 INFO mapreduce.Job: Job job_1497340925516_0009 running in uber mode : false17/06/22 14:38:24 INFO mapreduce.Job:  map 0% reduce 0%17/06/22 14:38:31 INFO mapreduce.Job:  map 100% reduce 0%17/06/22 14:38:38 INFO mapreduce.Job:  map 100% reduce 50%17/06/22 14:38:39 INFO mapreduce.Job:  map 100% reduce 100%17/06/22 14:38:40 INFO mapreduce.Job: Job job_1497340925516_0009 completed successfully17/06/22 14:38:40 INFO mapreduce.Job: Counters: 49    File System Counters        FILE: Number of bytes read=120        FILE: Number of bytes written=344705        FILE: Number of read operations=0        FILE: Number of large read operations=0        FILE: Number of write operations=0        HDFS: Number of bytes read=146        HDFS: Number of bytes written=54        HDFS: Number of read operations=9        HDFS: Number of large read operations=0        HDFS: Number of write operations=4    Job Counters         Launched map tasks=1        Launched reduce tasks=2        Data-local map tasks=1        Total time spent by all maps in occupied slots (ms)=4784        Total time spent by all reduces in occupied slots (ms)=11119        Total time spent by all map tasks (ms)=4784        Total time spent by all reduce tasks (ms)=11119        Total vcore-seconds taken by all map tasks=4784        Total vcore-seconds taken by all reduce tasks=11119        Total megabyte-seconds taken by all map tasks=4898816        Total megabyte-seconds taken by all reduce tasks=11385856    Map-Reduce Framework        Map input records=4        Map output records=8        Map output bytes=79        Map output materialized bytes=112        Input split bytes=99        Combine input records=8        Combine output records=7        Reduce input groups=7        Reduce shuffle bytes=112        Reduce input records=7        Reduce output records=7        Spilled Records=14        Shuffled Maps =2        Failed Shuffles=0        Merged Map outputs=2        GC time elapsed (ms)=134        CPU time spent (ms)=3220        Physical memory (bytes) snapshot=874729472        Virtual memory (bytes) snapshot=4710256640        Total committed heap usage (bytes)=860356608    Shuffle Errors        BAD_ID=0        CONNECTION=0        IO_ERROR=0        WRONG_LENGTH=0        WRONG_MAP=0        WRONG_REDUCE=0    File Input Format Counters         Bytes Read=47    File Output Format Counters         Bytes Written=54[root@Vm90 wxl]#

这里写图片描述

其他参考：
第一个MapReduce程序——WordCount
MapReduce程序-map和reduce的过程

阅读全文

0 0