Hadoop词频统计(一)之集群模式运行
来源:互联网 发布:sql注入的防御 编辑:程序博客网 时间:2024/05/29 19:34
maven pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>HadoopStu</groupId><artifactId>HadoopStu</artifactId><version>0.0.1-SNAPSHOT</version><build><sourceDirectory>src</sourceDirectory><resources><resource><directory>src</directory><excludes><exclude>**/*.java</exclude></excludes></resource></resources><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.3</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin></plugins></build><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-core</artifactId><version>1.2.1</version></dependency><!-- https://mvnrepository.com/artifact/junit/junit --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version></dependency></dependencies></project>
map:
package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.util.StringUtils;public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)throws IOException, InterruptedException {// TODO Auto-generated method stubString line = value.toString();String[] words = StringUtils.split(line,' ');for(String word : words) {context.write(new Text(word), new LongWritable(1));}}}
reduce:
package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {@Overrideprotected void reduce(Text key, Iterable<LongWritable> values,Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {long count = 0;for(LongWritable value : values) {count += value.get();}context.write(key, new LongWritable(count));}}
run:
package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WCRunner {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job wcjob = Job.getInstance(conf);wcjob.setJarByClass(WCRunner.class);wcjob.setMapperClass(WCMapper.class);wcjob.setReducerClass(WCReducer.class);wcjob.setOutputKeyClass(Text.class);wcjob.setOutputValueClass(LongWritable.class);wcjob.setMapOutputKeyClass(Text.class);wcjob.setMapOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(wcjob, "/wc/inputdata/");FileOutputFormat.setOutputPath(wcjob, new Path("/output/"));wcjob.waitForCompletion(true);}}生成输入数据:
[hadoop@hadoop01 ~]$ cat in.dat
haha lalala
hehe heiheihei
heiheihei lololo
lololo haha
haha haha
hehe lololo
在HDFS上创建相应路径:
[hadoop@hadoop01 ~]$ hadoop fs -mkdir -p /wc/inputdata
将in.dat文本文件上传到HDFS上的相应路径下:
[hadoop@hadoop01 ~]$ hadoop fs -put in.dat /wc/inputdata/
将上面的java程序打成jar包上传服务器,然后通过Hadoop调用:
hadoop jar mr.jar cn.hadoop.mr.WCRunner
[hadoop@hadoop01 ~]$ hadoop jar wc.jar cn.hadoop.mr.WCRunner16/07/25 15:25:05 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.200:803216/07/25 15:25:06 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.16/07/25 15:25:06 INFO input.FileInputFormat: Total input paths to process : 116/07/25 15:25:06 INFO mapreduce.JobSubmitter: number of splits:116/07/25 15:25:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1469431467769_000116/07/25 15:25:07 INFO impl.YarnClientImpl: Submitted application application_1469431467769_000116/07/25 15:25:07 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1469431467769_0001/16/07/25 15:25:07 INFO mapreduce.Job: Running job: job_1469431467769_000116/07/25 15:25:16 INFO mapreduce.Job: Job job_1469431467769_0001 running in uber mode : false16/07/25 15:25:16 INFO mapreduce.Job: map 0% reduce 0%16/07/25 15:25:23 INFO mapreduce.Job: map 100% reduce 0%16/07/25 15:25:30 INFO mapreduce.Job: map 100% reduce 100%16/07/25 15:25:31 INFO mapreduce.Job: Job job_1469431467769_0001 completed successfully16/07/25 15:25:31 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=204FILE: Number of bytes written=211397FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=183HDFS: Number of bytes written=44HDFS: Number of read operations=6HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=4219Total time spent by all reduces in occupied slots (ms)=4519Total time spent by all map tasks (ms)=4219Total time spent by all reduce tasks (ms)=4519Total vcore-seconds taken by all map tasks=4219Total vcore-seconds taken by all reduce tasks=4519Total megabyte-seconds taken by all map tasks=4320256Total megabyte-seconds taken by all reduce tasks=4627456Map-Reduce FrameworkMap input records=6Map output records=12Map output bytes=174Map output materialized bytes=204Input split bytes=105Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=204Reduce input records=12Reduce output records=5Spilled Records=24Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=93CPU time spent (ms)=1100Physical memory (bytes) snapshot=348495872Virtual memory (bytes) snapshot=1864597504Total committed heap usage (bytes)=219480064Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=78File Output Format Counters Bytes Written=44
输出结果如下:
haha 4
hehe 2
heiheihei 2
lalala 1
lololo 3
0 0
- Hadoop词频统计(一)之集群模式运行
- Hadoop词频统计(二)之本地模式运行
- Hadoop基础学习(一)分析、编写并运行WordCount词频统计程序
- Hadoop词频统计源码运行与分析
- Hadoop 词频统计(续)
- Hadoop学习之莎士比亚文档词频统计
- 初学Hadoop之WordCount词频统计
- 初学Hadoop之WordCount词频统计
- Hadoop中文词频统计
- Hadoop统计词频测试
- Hadoop/spark安装实战(系列篇4) Hadoop MapReduce词频统计之小试牛刀
- Python3.5+PyQt5词频统计(一)
- Hadoop IK分词 词频统计
- 机器学习之文本分类-从词频统计到神经网络(一)
- 词频统计(上机)
- NLTK 词频统计(一) 词频统计,绘图,词性标注
- 词频统计(一):C++使用Vector做词频统计
- Hadoop最基本的wordcount(统计词频)
- 面向对象设计
- 欢迎使用CSDN-markdown编辑器
- Linux基础----用户(修改密码)和用户组
- Python学习笔记18:Python多线程编程
- 位示图用法的总结
- Hadoop词频统计(一)之集群模式运行
- dispatch_group
- Android之获取当前位置的经纬度
- 线程 —— 线程池简单介绍
- 整合storm-hdfs过程中源码学习
- xcode少有人知的宏定义
- 一个属性引发的血案:parent.$("iframe[title='供应商管理']").get(0).contentWindow
- 热血军团-vSyncCount
- EF基础知识概览