Hadoop之——WorldCount统计实例
来源:互联网 发布:青云软件 编辑:程序博客网 时间:2024/06/06 03:23
转载请注明出处:http://blog.csdn.net/l1028386804/article/details/78238100
最近,有很多想做大数据的同学发来私信,想请我这位在大数据领域跌打滚爬了多年的老鸟写一些大数据分析的文章,好作为这些同学学习大数据分析从入门到上手再到精通的参考教程,作为一个大数据分析领域的老鸟,很高兴自己在业界得到了很多同行的认可,同时,自己也想将多年来做大数据分析的一些经验和心得分享给大家。那么,今天,就给大家带来一篇Hadoop的入门经典——WordCount统计实例。
一、准备工作
1、Hadoop安装
(1) 伪分布式安装
请参考博文:《Hadoop之——Hadoop2.4.1伪分布搭建》
(2) 集群安装
请参考博文《Hadoop之——CentOS + hadoop2.5.2分布式环境配置》
(3) 高可用集群安装
请参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)前期准备》和《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)》
2、Eclipse配置
本实例中所有的代码开发和运行都是在Eclipse中进行的,大家可以参考博文《Hadoop之——windows7+eclipse+hadoop2.5.2环境配置》对自己的Eclispe进行相关的配置,以达到在Eclipse中直接运行本实例以及后续Hadoop实例的效果。
二、程序开发
1、统计单词数量的WCMapper类
package com.lyz.hdfs.mr.worldcount;import java.io.IOException;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;/** * 统计单词数量的Mapper * KEYIN:map输入的key,代表当前输入文本的偏移量 * VALUEIN:当前一行文本 * KEYOUT:一个单词 * VALUEOUT:单此每次统计的数据,此示例中就是1 * @author liuyazhuang * */public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {String line = value.toString();String[] words = StringUtils.split(line, " "); for(String word : words){context.write(new Text(word), new LongWritable(1));}}}
2、统计单词数量的Reducer类
package com.lyz.hdfs.mr.worldcount;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;/** * 统计单词数量的reducer类 * KEYIN:当前的一个单词 * VALUEIN:map中输入过来的单词数量 * KEYOUT:当前的一个单词 * VALUEOUT:单词出现的总次数 * @author liuyazhuang * */public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {long count = 0;for(LongWritable value : values){count += value.get();}context.write(key, new LongWritable(count));}}
3、运行程序的入口WCRunner类
package com.lyz.hdfs.mr.worldcount;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;/** * 运行统计单词数量的MR程序 * @author liuyazhuang * */public class WCRunner extends Configured implements Tool{public static void main(String[] args) throws Exception{ToolRunner.run(new Configuration(), new WCRunner(), args);}@Overridepublic int run(String[] arg0) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(WCRunner.class);job.setMapperClass(WCMapper.class);job.setReducerClass(WCReducer.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(job, new Path("D:/hadoop_data/wordcount/src.txt"));FileOutputFormat.setOutputPath(job, new Path("D:/hadoop_data/wordcount/dest"));return job.waitForCompletion(true) ? 0 : 1;}}
三、运行程序
在Eclipse中直接右键类WCRunner, Run As——> Java Application,控制台输出结果如下:
2017-10-14 23:52:51,865 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - session.id is deprecated. Instead, use dfs.metrics.session-id2017-10-14 23:52:51,868 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=2017-10-14 23:52:52,665 WARN [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.2017-10-14 23:52:52,669 WARN [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(259)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String).2017-10-14 23:52:52,675 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 12017-10-14 23:52:52,713 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:12017-10-14 23:52:52,788 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_local994420281_00012017-10-14 23:52:52,820 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/staging/liuyazhuang994420281/.staging/job_local994420281_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-10-14 23:52:52,822 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/staging/liuyazhuang994420281/.staging/job_local994420281_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-10-14 23:52:52,908 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/local/localRunner/liuyazhuang/job_local994420281_0001/job_local994420281_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-10-14 23:52:52,909 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/local/localRunner/liuyazhuang/job_local994420281_0001/job_local994420281_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-10-14 23:52:52,913 INFO [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://localhost:8080/2017-10-14 23:52:52,914 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_local994420281_00012017-10-14 23:52:52,915 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null2017-10-14 23:52:52,921 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter2017-10-14 23:52:52,956 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks2017-10-14 23:52:52,956 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local994420281_0001_m_000000_02017-10-14 23:52:52,982 INFO [LocalJobRunner Map Task Executor #0] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.2017-10-14 23:52:53,048 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@50b84eb32017-10-14 23:52:53,051 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(733)) - Processing split: file:/D:/hadoop_data/wordcount/src.txt:0+1732017-10-14 23:52:53,060 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:createSortingCollector(388)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer2017-10-14 23:52:53,089 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:setEquator(1182)) - (EQUATOR) 0 kvi 26214396(104857584)2017-10-14 23:52:53,089 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(975)) - mapreduce.task.io.sort.mb: 1002017-10-14 23:52:53,089 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(976)) - soft limit at 838860802017-10-14 23:52:53,089 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(977)) - bufstart = 0; bufvoid = 1048576002017-10-14 23:52:53,089 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(978)) - kvstart = 26214396; length = 65536002017-10-14 23:52:53,096 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 2017-10-14 23:52:53,097 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1437)) - Starting flush of map output2017-10-14 23:52:53,097 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1455)) - Spilling map output2017-10-14 23:52:53,097 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1456)) - bufstart = 0; bufend = 326; bufvoid = 1048576002017-10-14 23:52:53,097 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1458)) - kvstart = 26214396(104857584); kvend = 26214320(104857280); length = 77/65536002017-10-14 23:52:53,114 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:sortAndSpill(1641)) - Finished spill 02017-10-14 23:52:53,122 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local994420281_0001_m_000000_0 is done. And is in the process of committing2017-10-14 23:52:53,129 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map2017-10-14 23:52:53,129 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local994420281_0001_m_000000_0' done.2017-10-14 23:52:53,129 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local994420281_0001_m_000000_02017-10-14 23:52:53,129 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.2017-10-14 23:52:53,132 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks2017-10-14 23:52:53,132 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local994420281_0001_r_000000_02017-10-14 23:52:53,138 INFO [pool-3-thread-1] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.2017-10-14 23:52:53,178 INFO [pool-3-thread-1] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@1c3a583d2017-10-14 23:52:53,182 INFO [pool-3-thread-1] mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a7f319c2017-10-14 23:52:53,192 INFO [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(193)) - MergerManager: memoryLimit=1503238528, maxSingleShuffleLimit=375809632, mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=102017-10-14 23:52:53,194 INFO [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local994420281_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events2017-10-14 23:52:53,216 INFO [localfetcher#1] reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(140)) - localfetcher#1 about to shuffle output of map attempt_local994420281_0001_m_000000_0 decomp: 368 len: 372 to MEMORY2017-10-14 23:52:53,220 INFO [localfetcher#1] reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 368 bytes from map-output for attempt_local994420281_0001_m_000000_02017-10-14 23:52:53,239 INFO [localfetcher#1] reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(307)) - closeInMemoryFile -> map-output of size: 368, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3682017-10-14 23:52:53,239 INFO [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning2017-10-14 23:52:53,240 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-10-14 23:52:53,240 INFO [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(667)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs2017-10-14 23:52:53,251 INFO [pool-3-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments2017-10-14 23:52:53,251 INFO [pool-3-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 359 bytes2017-10-14 23:52:53,254 INFO [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(742)) - Merged 1 segments, 368 bytes to disk to satisfy reduce memory limit2017-10-14 23:52:53,255 INFO [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(772)) - Merging 1 files, 372 bytes from disk2017-10-14 23:52:53,255 INFO [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(787)) - Merging 0 segments, 0 bytes from memory into reduce2017-10-14 23:52:53,256 INFO [pool-3-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments2017-10-14 23:52:53,256 INFO [pool-3-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 359 bytes2017-10-14 23:52:53,257 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-10-14 23:52:53,264 INFO [pool-3-thread-1] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords2017-10-14 23:52:53,269 INFO [pool-3-thread-1] mapred.Task (Task.java:done(1001)) - Task:attempt_local994420281_0001_r_000000_0 is done. And is in the process of committing2017-10-14 23:52:53,270 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-10-14 23:52:53,270 INFO [pool-3-thread-1] mapred.Task (Task.java:commit(1162)) - Task attempt_local994420281_0001_r_000000_0 is allowed to commit now2017-10-14 23:52:53,272 INFO [pool-3-thread-1] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local994420281_0001_r_000000_0' to file:/D:/hadoop_data/wordcount/dest/_temporary/0/task_local994420281_0001_r_0000002017-10-14 23:52:53,273 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce2017-10-14 23:52:53,273 INFO [pool-3-thread-1] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local994420281_0001_r_000000_0' done.2017-10-14 23:52:53,273 INFO [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local994420281_0001_r_000000_02017-10-14 23:52:53,274 INFO [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.2017-10-14 23:52:53,916 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_local994420281_0001 running in uber mode : false2017-10-14 23:52:53,917 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 100% reduce 100%2017-10-14 23:52:53,918 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1373)) - Job job_local994420281_0001 completed successfully2017-10-14 23:52:53,924 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 33File System CountersFILE: Number of bytes read=1438FILE: Number of bytes written=476268FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0Map-Reduce FrameworkMap input records=9Map output records=20Map output bytes=326Map output materialized bytes=372Input split bytes=103Combine input records=0Combine output records=0Reduce input groups=17Reduce shuffle bytes=372Reduce input records=20Reduce output records=17Spilled Records=40Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=0CPU time spent (ms)=0Physical memory (bytes) snapshot=0Virtual memory (bytes) snapshot=0Total committed heap usage (bytes)=385875968Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=173File Output Format Counters Bytes Written=176
四、附录
本实例所用的输入文件为src.txt,内容如下所示:
dysdgy ubdh shdhssusdfy sdusf duyfufuyfuyfys sydfyusd sydufyusdhfdf fyudyfu dyuefyuedfhusf fyueyf dyiefyu sudiufiliuyazhuangliuyazhuangliuyazhuangliuyazhuang输出的结果文件为part-r-00000,内容如下所示:
dfhusf1dhfdf1duyfu1dyiefyu1dysdgy1dyuefyue1fuyfuyfys1fyudyfu1fyueyf1liuyazhuang4sdusf1shdh1ssusdfy1sudiufi1sydfyusd1sydufyus1ubdh1
至此,基于Hadoop的统计单词数量的MapReduce程序开发完成。
五、温馨提示
大家如果在开发过程中遇到了问题,请参考Hadoop专栏中的其他相关博文解决相关问题。
- Hadoop之——WorldCount统计实例
- Hadoop WorldCount程序
- SparkStreaming的worldCount实例
- 使用Hadoop运行WorldCount示例
- 基于Eclipse的hadoop开发环境配置及worldCount实例运行
- Hadoop入门实例——WordCount统计单词
- Hadoop之——HDFS操作实例
- Hadoop之——RPC通信实例
- hadoop 运行WorldCount时出错,找不到主程序
- Hadoop实例-----统计单词个数
- Hadoop Herriot测试框架之旅——演示实例
- Hadoop学习笔记(七)---简单WorldCount程序的实现
- hadoop学习4--MapReduce及官方WorldCount分析
- Hadoop学习笔记 --- MapReduce实现WorldCount原理解析
- hadoop实例之HELLOWORLD
- hadoop实例之HELLOWORLD
- hadoop之mapreduce实例
- Hadoop实例之MaxTemperature
- 建立网站流程
- 初识多态
- 在项目中自定义路径放入element-ui并修改编译源码
- java的动态代理机制详解
- Sheel 希尔排序 -mobai_dalao
- Hadoop之——WorldCount统计实例
- 软件性能测试基本概念
- HTML5+CSS3新增内容总结!
- 5大框架简单总结 运行原理
- 深度学习-MatConvNet(1)-basic
- 趣图:Web 应用的可视化
- 面向对象:相知相惜,相互照顾,一起到老
- IDE 10 月指数榜:Eclipse 反超 Visual Studio
- 育碧 HR 漫谈那些“奇葩”的面试回答, 如“我玩过育碧的极品飞车…”