hadoop学习记(4)--MapReduce(wordcount)
来源:互联网 发布:电脑桌面工作计划软件 编辑:程序博客网 时间:2024/05/16 15:23
mapreduce原理我就不讲了,这篇已经讲过
这篇学习如何通过java来编写一个mapreduce模型的wordcount程序用于统计单词出现个数
所需的jar包与上一篇一致
编码
TokenizerMapper.java
package com.cwh.mapreduce;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{@Overrideprotected void map(Object key, Text value,Context context)throws IOException, InterruptedException {//拿到一行文本内容,转换成String 类型 String line = value.toString(); //将这行文本切分成单词 String[] words=line.split(" "); for(String word:words){ context.write(new Text(word), new IntWritable(1)); }}}
package com.cwh.mapreduce;import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {protected void reduce(Text key, Iterable<IntWritable> value,Context context)throws IOException, InterruptedException {Iterator<IntWritable> values = value.iterator();int count = 0;while(values.hasNext()){count += values.next().get();}context.write(key, new IntWritable(count));}}
WordCount.java
package com.cwh.mapreduce;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration(); Job wordCountJob = Job.getInstance(conf); //重要:指定本job所在的jar包 wordCountJob.setJarByClass(WordCount.class); //设置wordCountJob所用的mapper逻辑类为哪个类 wordCountJob.setMapperClass(TokenizerMapper.class); //设置wordCountJob所用的reducer逻辑类为哪个类 wordCountJob.setReducerClass(IntSumReducer.class); //设置map阶段输出的kv数据类型 wordCountJob.setMapOutputKeyClass(Text.class); wordCountJob.setMapOutputValueClass(IntWritable.class); //设置最终输出的kv数据类型 wordCountJob.setOutputKeyClass(Text.class); wordCountJob.setOutputValueClass(IntWritable.class); //设置要处理的文本数据所存放的路径 FileInputFormat.setInputPaths(wordCountJob, "hdfs://192.168.27.131:9000/hdfsTest/"); FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://192.168.27.131:9000/hdfsTest/output/")); //提交job给hadoop集群 boolean flag = wordCountJob.waitForCompletion(true); if (flag){ System.out.println("操作成功!"); }else { System.out.println("操作失败!"); } System.exit(1); }}
运行测试
我是windows下eclipse开发运行的,所以会报如下错误
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries我们只需下载https://github.com/srccodes/hadoop-common-2.2.0-bin这个文件然后解压,环境变量配置下,重启电脑即可
添加HADOOP_HOME
然后path添加:%HADOOP_HOME%\bin
classpath添加:%HADOOP_HOME%\bin\winutils.exe;
接着运行会报权限错误,我干脆把hdfs的权限关闭即可;
修改hdfs-site.xml,添加如下内容,修改后需要重启hadoop
<configuration> <property> <name>dfs.permissions</name> <value>false</value> </property></configuration>
上一篇写到的是上传了一个文件到hdfsTest目录下名为text.txt,现在我们直接用它即可,text.txt内容如下:
运行后,查看hadoop客户端如下:
可看到生成了两个文件,part-r-0000就是我们结果文件,可下载打开查看:
ok!至此我们就实现了个简单的wordcount
阅读全文