Hadoop词频统计（一）之集群模式运行

来源：互联网发布：sql注入的防御编辑：程序博客网时间：2024/05/29 19:34

maven pom.xml：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>HadoopStu</groupId><artifactId>HadoopStu</artifactId><version>0.0.1-SNAPSHOT</version><build><sourceDirectory>src</sourceDirectory><resources><resource><directory>src</directory><excludes><exclude>**/*.java</exclude></excludes></resource></resources><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.3</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin></plugins></build><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-core</artifactId><version>1.2.1</version></dependency><!-- https://mvnrepository.com/artifact/junit/junit --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version></dependency></dependencies></project>

map:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.util.StringUtils;public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)throws IOException, InterruptedException {// TODO Auto-generated method stubString line = value.toString();String[] words = StringUtils.split(line,' ');for(String word : words) {context.write(new Text(word), new LongWritable(1));}}}

reduce:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {@Overrideprotected void reduce(Text key, Iterable<LongWritable> values,Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {long count = 0;for(LongWritable value : values) {count += value.get();}context.write(key, new LongWritable(count));}}

run:

package cn.hadoop.mr;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WCRunner {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job wcjob = Job.getInstance(conf);wcjob.setJarByClass(WCRunner.class);wcjob.setMapperClass(WCMapper.class);wcjob.setReducerClass(WCReducer.class);wcjob.setOutputKeyClass(Text.class);wcjob.setOutputValueClass(LongWritable.class);wcjob.setMapOutputKeyClass(Text.class);wcjob.setMapOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(wcjob, "/wc/inputdata/");FileOutputFormat.setOutputPath(wcjob, new Path("/output/"));wcjob.waitForCompletion(true);}}

生成输入数据：

[hadoop@hadoop01 ~]$ cat in.dat
haha lalala
hehe heiheihei
heiheihei lololo
lololo haha
haha haha
hehe lololo

在HDFS上创建相应路径：
[hadoop@hadoop01 ~]$ hadoop fs -mkdir -p /wc/inputdata

将in.dat文本文件上传到HDFS上的相应路径下：

[hadoop@hadoop01 ~]$ hadoop fs -put in.dat /wc/inputdata/

将上面的java程序打成jar包上传服务器，然后通过Hadoop调用：

hadoop jar mr.jar cn.hadoop.mr.WCRunner

[hadoop@hadoop01 ~]$ hadoop jar wc.jar  cn.hadoop.mr.WCRunner16/07/25 15:25:05 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.200:803216/07/25 15:25:06 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.16/07/25 15:25:06 INFO input.FileInputFormat: Total input paths to process : 116/07/25 15:25:06 INFO mapreduce.JobSubmitter: number of splits:116/07/25 15:25:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1469431467769_000116/07/25 15:25:07 INFO impl.YarnClientImpl: Submitted application application_1469431467769_000116/07/25 15:25:07 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1469431467769_0001/16/07/25 15:25:07 INFO mapreduce.Job: Running job: job_1469431467769_000116/07/25 15:25:16 INFO mapreduce.Job: Job job_1469431467769_0001 running in uber mode : false16/07/25 15:25:16 INFO mapreduce.Job:  map 0% reduce 0%16/07/25 15:25:23 INFO mapreduce.Job:  map 100% reduce 0%16/07/25 15:25:30 INFO mapreduce.Job:  map 100% reduce 100%16/07/25 15:25:31 INFO mapreduce.Job: Job job_1469431467769_0001 completed successfully16/07/25 15:25:31 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=204FILE: Number of bytes written=211397FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=183HDFS: Number of bytes written=44HDFS: Number of read operations=6HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=4219Total time spent by all reduces in occupied slots (ms)=4519Total time spent by all map tasks (ms)=4219Total time spent by all reduce tasks (ms)=4519Total vcore-seconds taken by all map tasks=4219Total vcore-seconds taken by all reduce tasks=4519Total megabyte-seconds taken by all map tasks=4320256Total megabyte-seconds taken by all reduce tasks=4627456Map-Reduce FrameworkMap input records=6Map output records=12Map output bytes=174Map output materialized bytes=204Input split bytes=105Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=204Reduce input records=12Reduce output records=5Spilled Records=24Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=93CPU time spent (ms)=1100Physical memory (bytes) snapshot=348495872Virtual memory (bytes) snapshot=1864597504Total committed heap usage (bytes)=219480064Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=78File Output Format Counters Bytes Written=44

输出结果如下：

haha   4
hehe   2
heiheihei   2
lalala   1
lololo   3

0 0