Hadoop与Spark算法分析（一）——WordCount

来源：互联网发布：微信淘客软件百度云编辑：程序博客网时间：2024/05/22 05:06

WordCount是大数据编程的入门程序，实现对输入文件中每个单词出现次数的统计，可应用于海量文本的词频检索。过程如下图所示：

1. Hadoop实现

map过程调用map函数以文件中每行首个字符的偏移量和整行值为输入参数，将值进行单词的拆分，并最终输出（单词，1）的键值对。
reduce过程从各Map端收集得到（单词，列表（1,1，…1））键值对，通过对值列表相加计算单词主键的出现频数，最终得到（单词，频数）键值对。
具体代码实现如下：

package org.hadoop.test;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {    public static class TokenizerMapper extends        Mapper<Object, Text, Text, IntWritable> {            private final static IntWritable one = new IntWritable(1);            private Text word = new Text();            public void map(Object key, Text value, Context context)                     throws IOException, InterruptedException{                StringTokenizer line = new StringTokenizer(value.toString());                while (line.hasMoreTokens()){                    word.set(line.nextToken());                    context.write(word, one);                }            }    }    public static class IntSumReducer extends        Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable result = new IntWritable();        public void reduce(Text key,  Iterable<IntWritable> values, Context context)                 throws IOException, InterruptedException{            int sum = 0;            for (IntWritable val : values){                sum += val.get();             }            result.set(sum);            context.write(key, result);        }    }    public static void main(String[] args)             throws Exception {        Configuration conf = new Configuration();        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();        if (otherArgs.length < 2){            System.err.println("Usage: wordcount <in> <out>");            System.exit(2);        }        Job job = new Job(conf, "WordCount");        job.setJarByClass(WordCount.class);        job.setMapperClass(TokenizerMapper.class);        job.setCombinerClass(IntSumReducer.class);        job.setReducerClass(IntSumReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));        System.exit(job.waitForCompletion(true)? 0:1);    }}

2. Spark实现

Spark算法的实现思想与Hadoop MapReduce相似，不再过多描述。由于Scala语言基于函数式的编程风格，使得代码的书写十分简洁。具体代码实现如下：

import org.apache.spark.{SparkConf, SparkContext}/**  * Created by rose on 16-4-20.  */object WordCount {  def main(args:Array[String]): Unit = {    if (args.length < 2) {      println("Usage:<in> <out>")      return    }    val conf = new SparkConf().setAppName("WordCount")    val sc = new SparkContext(conf)    val textRDD = sc.textFile(args(0))    val result = textRDD.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)    result.saveAsTextFile(args(1))  }}

3. 运行过程

1）上传本地文件到HDFS目录下
在HDFS上创建输入文件夹

$hadoop fs -mkdir -p wordcount/input

上传本地文件到集群的input目录下

$hadoop fs -put ~/file* wordcount/input

查看集群文件目录

$hadoop fs -ls wordcount/input

2）运行程序
将WordCount程序打包为后缀名为jar的压缩文件WordCount.jar，进入到压缩文件所在文件夹（这里以一个file输入文件和一个output输出文件夹为例说明）。
Hadoop程序运行如下命令执行

$hadoop jar ~/hadoop/WordCount.jar org.hadoop.test.WordCount wordcount/input/file wordcount/hadoop/output

Spark程序运行如下命令执行

$spark-submit --master yanr-client --class WorcCount ~/spark/WorcCount.jar hdfs://master:9000/wordcount/input/file hdfs://master:9000/wordcount/spark/output

3）查看运行结果
查看Hadoop执行结果

$hadoop fs -ls wordcount/hadoop/output

查看Spark执行结果

$hadoop fs -ls wordcount/spark/output

4. 测试对比

如图所示为WordCount算法测试对比图，该算法的数据集节选于网络小说。随着测试集大小的增加，WordCount程序的执行时间有所延长，总体上由于Spark需要将HDFS上的数据初始化为Spark RDD进而计算，使得Hadoop在该算法表现稍强于Spark。

阅读全文

1 0