Hadoop词频统计(一)之集群模式运行

来源:互联网 发布:sql注入的防御 编辑:程序博客网 时间:2024/05/29 19:34

maven pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>HadoopStu</groupId><artifactId>HadoopStu</artifactId><version>0.0.1-SNAPSHOT</version><build><sourceDirectory>src</sourceDirectory><resources><resource><directory>src</directory><excludes><exclude>**/*.java</exclude></excludes></resource></resources><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.3</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin></plugins></build><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-core</artifactId><version>1.2.1</version></dependency><!-- https://mvnrepository.com/artifact/junit/junit --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version></dependency></dependencies></project>



map:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.util.StringUtils;public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)throws IOException, InterruptedException {// TODO Auto-generated method stubString line = value.toString();String[] words = StringUtils.split(line,' ');for(String word : words) {context.write(new Text(word), new LongWritable(1));}}}

reduce:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {@Overrideprotected void reduce(Text key, Iterable<LongWritable> values,Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {long count = 0;for(LongWritable value : values) {count += value.get();}context.write(key, new LongWritable(count));}}

run:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WCRunner {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job wcjob = Job.getInstance(conf);wcjob.setJarByClass(WCRunner.class);wcjob.setMapperClass(WCMapper.class);wcjob.setReducerClass(WCReducer.class);wcjob.setOutputKeyClass(Text.class);wcjob.setOutputValueClass(LongWritable.class);wcjob.setMapOutputKeyClass(Text.class);wcjob.setMapOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(wcjob, "/wc/inputdata/");FileOutputFormat.setOutputPath(wcjob, new Path("/output/"));wcjob.waitForCompletion(true);}}
生成输入数据:

[hadoop@hadoop01 ~]$ cat in.dat
haha lalala
hehe heiheihei
heiheihei lololo
lololo haha
haha haha
hehe lololo


在HDFS上创建相应路径:
[hadoop@hadoop01 ~]$ hadoop fs -mkdir -p /wc/inputdata


将in.dat文本文件上传到HDFS上的相应路径下:

[hadoop@hadoop01 ~]$ hadoop fs -put in.dat /wc/inputdata/


将上面的java程序打成jar包上传服务器,然后通过Hadoop调用:

hadoop jar mr.jar cn.hadoop.mr.WCRunner

[hadoop@hadoop01 ~]$ hadoop jar wc.jar  cn.hadoop.mr.WCRunner16/07/25 15:25:05 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.200:803216/07/25 15:25:06 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.16/07/25 15:25:06 INFO input.FileInputFormat: Total input paths to process : 116/07/25 15:25:06 INFO mapreduce.JobSubmitter: number of splits:116/07/25 15:25:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1469431467769_000116/07/25 15:25:07 INFO impl.YarnClientImpl: Submitted application application_1469431467769_000116/07/25 15:25:07 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1469431467769_0001/16/07/25 15:25:07 INFO mapreduce.Job: Running job: job_1469431467769_000116/07/25 15:25:16 INFO mapreduce.Job: Job job_1469431467769_0001 running in uber mode : false16/07/25 15:25:16 INFO mapreduce.Job:  map 0% reduce 0%16/07/25 15:25:23 INFO mapreduce.Job:  map 100% reduce 0%16/07/25 15:25:30 INFO mapreduce.Job:  map 100% reduce 100%16/07/25 15:25:31 INFO mapreduce.Job: Job job_1469431467769_0001 completed successfully16/07/25 15:25:31 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=204FILE: Number of bytes written=211397FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=183HDFS: Number of bytes written=44HDFS: Number of read operations=6HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=4219Total time spent by all reduces in occupied slots (ms)=4519Total time spent by all map tasks (ms)=4219Total time spent by all reduce tasks (ms)=4519Total vcore-seconds taken by all map tasks=4219Total vcore-seconds taken by all reduce tasks=4519Total megabyte-seconds taken by all map tasks=4320256Total megabyte-seconds taken by all reduce tasks=4627456Map-Reduce FrameworkMap input records=6Map output records=12Map output bytes=174Map output materialized bytes=204Input split bytes=105Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=204Reduce input records=12Reduce output records=5Spilled Records=24Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=93CPU time spent (ms)=1100Physical memory (bytes) snapshot=348495872Virtual memory (bytes) snapshot=1864597504Total committed heap usage (bytes)=219480064Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=78File Output Format Counters Bytes Written=44


输出结果如下:

haha    4
hehe    2
heiheihei    2
lalala    1
lololo    3


0 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 重庆社保卡坏了怎么办 社保卡丢了看病怎么办 社保卡丢了买药怎么办 常州社保卡丢了怎么办 深圳社保卡掉了怎么办 上海医保卡丢了怎么办 户口转到西安后医保怎么办 上海医保卡掉了怎么办 上海医保本丢了怎么办? 新版医保卡丢了怎么办 武汉社保卡掉了怎么办 职工社保卡丢了怎么办 杭州社保卡丢了怎么办 农村医疗卡丢了怎么办 陕西省医保卡丢了怎么办 小孩社保卡掉了怎么办 社区医保本丢了怎么办 宝宝医保卡掉了怎么办 同煤医疗卡丢了怎么办 杭州医保卡丢了怎么办 新的医保卡丢了怎么办 二代医保卡丢了怎么办 老医保卡丢了怎么办 上海医保卡余额用完了怎么办 身份证丢了医疗报销怎么办 取公积金身份证丢了怎么办 身份证丢了怎么办就诊卡 人在外地怎么办农村社保卡 武汉医保卡丢了怎么办 济宁社保卡丢了怎么办 农村医疗本丢了怎么办 常熟医保卡丢了怎么办 农民社保卡丢了怎么办 常熟社保卡坏了怎么办 社保卡丢失补办期看病怎么办 社保卡补办期间看病怎么办 医保卡冻结了出院结算怎么办 住院医保卡钱不够怎么办 住院押金条丢了怎么办 急用新社保卡要怎么办 看病没带社保卡怎么办