Hadoop大数据平台入门——第一个小程序WordCount

来源：互联网发布：额温枪算法编辑：程序博客网时间：2024/06/03 13:20

首先我们需要安装Hadoop，并对Hadoop进行配置。这里我们就不赘述了，详情看这篇博客：Hadoop安装配置

值得注意的是，配置的时候，需要给Hadoop权限才能正确执行。最简单的办法就是讲hadoop以及其目录下所有文件都归在一个组中。

chown -R  hadoop:hadoop hadoop文件夹

就可以了。

配置完成之后，我们我们还需要什么？

1.需要在HDFS中保存有文件。

2.需要一个程序jar包，我们前面说过，JobTracker接收jar包就会分解job为mapTask和reduceTask。mapTask会读取HDFS中的文件来执行。

我们来看目标。

我们输入两个文件，file1和file2。交给hadoop执行之后，会返回file1和file2文件中的单词的计数。

我们说过，hadoop返回的是<key，value>的键值对的形式。

所以结果如下：也就是把单词以及单词的个数返回

school 1hello  3world 2...

所以我们首先创建两个文件：

file1和file2。

随便填点东西在里面，文件中的内容是用来计数。单词之间用空格分隔，当然这是不一定的，如何区分单词是在后面jar包中的map程序中分辨的。

我们写好了这两个文件之后，要将文件提交到HDFS中。如何提交呢？

提交之前，首先要确保hadoop已经运行起来了，查看jps可以看到hadoop的进程。

首先我们在hadoop的HDFS中创建一个文件夹。

hdfs dfs -mkdir input_wordcount

这样就可以在HDFS根目录下创建一个input_wordcount的文件夹。

其实Hadoop的HDFS命令行非常接近Shell，只需要使用hdfs dfs -后面写上shell命令就可以对应执操作HDFS文件系统了。

例如：hdfs dfs -ls查看根目录下的文件。

创建文件夹之后，我们就可以提交我们写的两个file文件。

hdfs dfs -put input/* input_wordcount

这里我两个file文件都放在input目录下，所以直接使用正则表达式都提交上去即可，提交到input_wordcount文件夹下。然后我们查看input_wordcount文件夹下的文件，查看是否提交完成。

hdfs dfs -ls input_wordcountFound 4 items-rw-r--r--   3 haoye supergroup         71 2017-05-06 20:34 input_wordcount/file1-rw-r--r--   3 haoye supergroup          0 2017-05-06 20:34 input_wordcount/file1~-rw-r--r--   3 haoye supergroup         74 2017-05-06 20:34 input_wordcount/file2-rw-r--r--   3 haoye supergroup          0 2017-05-06 20:34 input_wordcount/file2~

提交成功了。

第一个要求完成了，接下来我们就需要一个程序jar包。

打开IDE工具。创建一个java程序，我在这里创建一个maven项目。

首先我们需要导入依赖包：

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency>    <groupId>org.apache.hadoop</groupId>    <artifactId>hadoop-common</artifactId>    <version>2.6.0</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core --><dependency>    <groupId>org.apache.hadoop</groupId>    <artifactId>hadoop-mapreduce-client-core</artifactId>    <version>2.6.0</version></dependency>

然后我们创建一个WordCount类。

在这个类里，首先我们要创建一个Map方法，需要继承Mapper类：

public static class WordCountMap extends        Mapper<LongWritable, Text, Text, IntWritable> {    private final IntWritable one = new IntWritable(1);    private Text word = new Text();    public void map(LongWritable key, Text value, Context context)            throws IOException, InterruptedException {        String line = value.toString();        StringTokenizer token = new StringTokenizer(line);        while (token.hasMoreTokens()) {            word.set(token.nextToken());            context.write(word, one);        }    }}

Mapper<LongWritable, Text, Text, IntWritable>是什么意思呢？

前面两个类参数是输入，后面两个是输出。

也就是WordCOuntMap方法接收LongWritable，Text的参数，返回<Text， IntWriatable>键值对。

需要重写map方法，可以看到Context对象即为返回结果，内部其实是<Text， IntWriatable>键值对。

这里需要注意的是，value的值，value默认是一行数据，你文件中有多少行，map函数就会被调用多少次。

这我们就看懂了吧，首先拿到一行的数据，使用StringTokenizer根据空格分割字符串，得到token。遍历token并写入context中返回即可。

然后我们需要编写reduce方法：同样的，reduce方法继承reduce类。

public static class WordCountReduce extends        Reducer<Text, IntWritable, Text, IntWritable> {    public void reduce(Text key, Iterable<IntWritable> values,                       Context context) throws IOException, InterruptedException {        int sum = 0;        for (IntWritable val : values) {            sum += val.get();        }        context.write(key, new IntWritable(sum));    }}

wordCountReduce方法接收<Text, IntWritable>键值对，将键值对组合起来，结果写入另外一个键值对中，返回即可。

其中最重要是重写reduce方法，同样的context也是返回的结果。

这里需要注意的是，reduce方法是什么时候调用的呢？是在所有mapTask都被执行完成之后，reduceTask启动了才调用。

所有reduce方法中接收到的是所有map返回的参数。所以我们简单的求和写入context中就可以了。

最后我们编写main方法作为入口，调用两个函数。

public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    Job job = new Job(conf);    job.setJarByClass(WordCount.class);    job.setJobName("wordcount");    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    job.setMapperClass(WordCountMap.class);    job.setReducerClass(WordCountReduce.class);    job.setInputFormatClass(TextInputFormat.class);    job.setOutputFormatClass(TextOutputFormat.class);    FileInputFormat.addInputPath(job, new Path(args[0]));    FileOutputFormat.setOutputPath(job, new Path(args[1]));    job.waitForCompletion(true);}

这里我们主要是告诉JobTracker，告诉他去调用什么就可以了。

类都编写好了之后`，我们需要的是jar包，所以我们将程序打包为jar包。

拿到jar包之后，我们需要将jar包作为作业提交给Hadoop执行。怎么做呢？

hadoop jar WordCount.jar WordCount input_wordcount output_wordcount

hadoop jar WordCount.jar WordCount这里提交jar包，并且告诉主类在哪。
后面两个都是我们自定义的参数了。会在main中获取到，即输入参数为input_wordcount。输出参数为output_wordcount

执行完成之后可以看到。

hdfs dfs -lsFound 2 itemsdrwxr-xr-x   - haoye supergroup          0 2017-05-06 20:34 input_wordcountdrwxr-xr-x   - haoye supergroup          0 2017-05-06 20:40 output_wordcount

 hdfs dfs -ls output_wordcountFound 2 items-rw-r--r--   3 haoye supergroup          0 2017-05-06 20:40 output_wordcount/_SUCCESS-rw-r--r--   3 haoye supergroup         83 2017-05-06 20:40 output_wordcount/part-r-00000

其中part-r-00000为结果文件。

我们可以查看它的内容

hdfs dfs -cat output_wordcount/part-r-00000api1file3free2hadoop7hello3home1java2new2school1system1world2

得到结果了吧。

对于hadoop来说，执行任务需要操作HDFS，需要job对应的jar包。而jar包中需要编写mapTask和ReduceTask对应的方法。交给jobTracker执行就可以了。十分的方便。

1 0