Hadoop与Spark算法分析(一)——WordCount
来源:互联网 发布:微信淘客软件 百度云 编辑:程序博客网 时间:2024/05/22 05:06
WordCount是大数据编程的入门程序,实现对输入文件中每个单词出现次数的统计,可应用于海量文本的词频检索。过程如下图所示:
1. Hadoop实现
map过程调用map函数以文件中每行首个字符的偏移量和整行值为输入参数,将值进行单词的拆分,并最终输出(单词,1)的键值对。
reduce过程从各Map端收集得到(单词,列表(1,1,…1))键值对,通过对值列表相加计算单词主键的出现频数,最终得到(单词,频数)键值对。
具体代码实现如下:
package org.hadoop.test;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException{ StringTokenizer line = new StringTokenizer(value.toString()); while (line.hasMoreTokens()){ word.set(line.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{ int sum = 0; for (IntWritable val : values){ sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 2){ System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0:1); }}
2. Spark实现
Spark算法的实现思想与Hadoop MapReduce相似,不再过多描述。由于Scala语言基于函数式的编程风格,使得代码的书写十分简洁。具体代码实现如下:
import org.apache.spark.{SparkConf, SparkContext}/** * Created by rose on 16-4-20. */object WordCount { def main(args:Array[String]): Unit = { if (args.length < 2) { println("Usage:<in> <out>") return } val conf = new SparkConf().setAppName("WordCount") val sc = new SparkContext(conf) val textRDD = sc.textFile(args(0)) val result = textRDD.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) result.saveAsTextFile(args(1)) }}
3. 运行过程
1)上传本地文件到HDFS目录下
在HDFS上创建输入文件夹
$hadoop fs -mkdir -p wordcount/input
上传本地文件到集群的input目录下
$hadoop fs -put ~/file* wordcount/input
查看集群文件目录
$hadoop fs -ls wordcount/input
2)运行程序
将WordCount程序打包为后缀名为jar的压缩文件WordCount.jar,进入到压缩文件所在文件夹(这里以一个file输入文件和一个output输出文件夹为例说明)。
Hadoop程序运行如下命令执行
$hadoop jar ~/hadoop/WordCount.jar org.hadoop.test.WordCount wordcount/input/file wordcount/hadoop/output
Spark程序运行如下命令执行
$spark-submit --master yanr-client --class WorcCount ~/spark/WorcCount.jar hdfs://master:9000/wordcount/input/file hdfs://master:9000/wordcount/spark/output
3)查看运行结果
查看Hadoop执行结果
$hadoop fs -ls wordcount/hadoop/output
查看Spark执行结果
$hadoop fs -ls wordcount/spark/output
4. 测试对比
如图所示为WordCount算法测试对比图,该算法的数据集节选于网络小说。随着测试集大小的增加,WordCount程序的执行时间有所延长,总体上由于Spark需要将HDFS上的数据初始化为Spark RDD进而计算,使得Hadoop在该算法表现稍强于Spark。
阅读全文
1 0
- Hadoop与Spark算法分析(一)——WordCount
- Hadoop与Spark算法分析(二)——排序算法
- Hadoop与Spark算法分析(四)——PageRank算法
- Hadoop与Spark算法分析(三)——矩阵乘法
- Hadoop入门—WordCount代码分析
- spark学习1——配置hadoop 单机模式并运行WordCount实例(ubuntu14.04 & hadoop 2.6.0)
- 学习Hadoop MapReduce与WordCount例子分析
- Spark WordCount 数据流分析
- hadoop与spark学习记录(一)
- hadoop 实战———WordCount源码分析
- 基于hadoop与spark的大数据分析实战——第一章 Hadoop部署与实践
- hadoop wordcount源代码分析
- hadoop wordcount源代码分析
- Hadoop WordCount详细分析
- Hadoop学习笔记:(一)WordCount运行
- Hadoop入门案例(一) wordcount
- hadoop入门——wordcount
- Hadoop之MapReduce—Wordcount
- Ubuntu16.04配置HA的MapReduce
- 对数log、lg、ln
- convert-sorted-list-to-binary-search-tree Java code
- Swift开发之3DTouch实用演练
- linux tail 命令详解
- Hadoop与Spark算法分析(一)——WordCount
- 机器学习网址
- [BZOJ 1925][SDOI 2010] 地精部落 DP/递推
- List综合应用
- 2048-AI程序算法分析
- 程序员,如何从平庸走向理想?
- 浅谈java中的祖先类Object
- Fiddler抓包6-get请求(url详解)
- 截取字符串,改变第N个字符的颜色;自定义方法,oncreat中调用