hadoop学习记（4）--MapReduce（wordcount）

来源：互联网发布：电脑桌面工作计划软件编辑：程序博客网时间：2024/05/16 15:23

mapreduce原理我就不讲了，这篇已经讲过

这篇学习如何通过java来编写一个mapreduce模型的wordcount程序用于统计单词出现个数

所需的jar包与上一篇一致

编码

TokenizerMapper.java

package com.cwh.mapreduce;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{@Overrideprotected void map(Object key, Text value,Context context)throws IOException, InterruptedException {//拿到一行文本内容，转换成String 类型          String line = value.toString();          //将这行文本切分成单词          String[] words=line.split(" ");          for(String word:words){        context.write(new Text(word), new IntWritable(1));        }}}

IntSumReducer.java

package com.cwh.mapreduce;import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {protected void reduce(Text key, Iterable<IntWritable> value,Context context)throws IOException, InterruptedException {Iterator<IntWritable> values = value.iterator();int count = 0;while(values.hasNext()){count += values.next().get();}context.write(key, new IntWritable(count));}}

WordCount.java

package com.cwh.mapreduce;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();          Job wordCountJob = Job.getInstance(conf);                    //重要：指定本job所在的jar包          wordCountJob.setJarByClass(WordCount.class);                    //设置wordCountJob所用的mapper逻辑类为哪个类          wordCountJob.setMapperClass(TokenizerMapper.class);          //设置wordCountJob所用的reducer逻辑类为哪个类          wordCountJob.setReducerClass(IntSumReducer.class);                    //设置map阶段输出的kv数据类型          wordCountJob.setMapOutputKeyClass(Text.class);          wordCountJob.setMapOutputValueClass(IntWritable.class);                    //设置最终输出的kv数据类型          wordCountJob.setOutputKeyClass(Text.class);          wordCountJob.setOutputValueClass(IntWritable.class);                    //设置要处理的文本数据所存放的路径          FileInputFormat.setInputPaths(wordCountJob, "hdfs://192.168.27.131:9000/hdfsTest/");          FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://192.168.27.131:9000/hdfsTest/output/"));                    //提交job给hadoop集群          boolean flag =  wordCountJob.waitForCompletion(true);          if (flag){              System.out.println("操作成功!");          }else {              System.out.println("操作失败!");          }          System.exit(1);  }}

运行测试

我是windows下eclipse开发运行的，所以会报如下错误

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries

我们只需下载https://github.com/srccodes/hadoop-common-2.2.0-bin这个文件然后解压，环境变量配置下，重启电脑即可

添加HADOOP_HOME

然后path添加：%HADOOP_HOME%\bin

classpath添加：%HADOOP_HOME%\bin\winutils.exe;

接着运行会报权限错误，我干脆把hdfs的权限关闭即可；

修改hdfs-site.xml，添加如下内容，修改后需要重启hadoop

<configuration> <property>   <name>dfs.permissions</name>   <value>false</value> </property></configuration>

这样运行后会把结果输出到/hdfsTest/output下

上一篇写到的是上传了一个文件到hdfsTest目录下名为text.txt，现在我们直接用它即可，text.txt内容如下：

运行后，查看hadoop客户端如下：

可看到生成了两个文件，part-r-0000就是我们结果文件，可下载打开查看：

ok!至此我们就实现了个简单的wordcount

阅读全文

'); })();