hadoop+intellij+maven实现wordcount程序

来源:互联网 发布:java程序设计大学课本 编辑:程序博客网 时间:2024/06/14 05:15

前言:由于工作原因,第一次接触maven和hadoop,所以在学习的过程中,看了很多博客,踩了很多坑,也总结了一些经验,现在大致记录如下。有需要的朋友请自取。


1.什么是maven
具体可以这两篇博客。
maven入门http://www.cnblogs.com/now-fighting/p/4857625.html
maven体系结构https://www.cnblogs.com/now-fighting/p/4858982.html

2.hadoop集群配置

具体可以看这个系列:http://www.powerxing.com/install-hadoop/

值得注意的是,我的环境是centos7,本地没有jdk版本,只有jre,因此我自己下了一个jdk。其实 不运行java程序的话,jre就足够,但我还是下载了。

3.intellij构建一个maven项目

参考这篇博客:http://blog.csdn.net/qq_32588349/article/details/51461182

我的环境情况是:

Windows执行maven项目,hadoop集群在centos7上,我配的是伪分布式集群。

我配置的pom.xml如下:

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">    <modelVersion>4.0.0</modelVersion>    <groupId>com.wordcount_1</groupId>    <artifactId>wordcount_1</artifactId>    <version>1.0-SNAPSHOT</version>    <packaging>jar</packaging>    <repositories>        <repository>            <id>apache</id>            <url>http://maven.apache.org</url>        </repository>    </repositories>    <dependencies>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-common</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-hdfs</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-mapreduce-client-core</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-mapreduce-client-common</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-client</artifactId>            <version>2.6.5</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-core</artifactId>            <version>1.2.0</version>        </dependency>    </dependencies>

**配置hadoop需要如下依赖:
基础依赖hadoop-core和hadoop-common;
如果需要读写HDFS,则还需要依赖hadoop-hdfs和hadoop-client;如果需要读写HBase,则还需要依赖hbase-client**

配置完pom.xml后,编写代码。我的项目的逻辑结构如下:

这里写图片描述

WordCount程序:

package job;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context)throws IOException, InterruptedException {    StringTokenizer itr = new StringTokenizer(value.toString());    while (itr.hasMoreTokens()) {    word.set(itr.nextToken());    context.write(word, one);}}}public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {    sum += val.get(); }    result.set(sum);    context.write(key, result);}}public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {    System.err.println("Usage: wordcount <in> <out>");    System.exit(2);}    Job job = new Job(conf, "word count");    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);    }}

注意要在resources目录下放三个文件。
hdfs-site.xml
core-site.xml
log4.properties

这样一来,能够控制日志输出格式,以及默认使用hdfs存储输入输出。否则默认使用本地输入输出。这几个文件在hadoop安装目录下的conf文件夹。

4.intellij编译执行

这里写图片描述

选择run–>edit configuration–>选择Application。填写main.Class。
在program.arguments配置好参数。这两个参数是输入输出参数,都是hdfs存储。

5.查看输入输出情况

输入情况

[pangmingyu@Centos1 opt]$  hdfs dfs -ls /user/pangmingyu/input/17/12/21 12:09:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 8 items-rw-r--r--   1 pangmingyu supergroup       4436 2017-12-18 14:52 /user/pangmingyu/input/capacity-scheduler.xml-rw-r--r--   1 pangmingyu supergroup        997 2017-12-18 14:52 /user/pangmingyu/input/core-site.xml-rw-r--r--   1 pangmingyu supergroup       9683 2017-12-18 14:52 /user/pangmingyu/input/hadoop-policy.xml-rw-r--r--   1 pangmingyu supergroup       1346 2017-12-18 14:52 /user/pangmingyu/input/hdfs-site.xml-rw-r--r--   1 pangmingyu supergroup        620 2017-12-18 14:52 /user/pangmingyu/input/httpfs-site.xml-rw-r--r--   1 pangmingyu supergroup       3523 2017-12-18 14:52 /user/pangmingyu/input/kms-acls.xml-rw-r--r--   1 pangmingyu supergroup       5511 2017-12-18 14:52 /user/pangmingyu/input/kms-site.xml-rw-r--r--   1 pangmingyu supergroup        690 2017-12-18 14:52 /user/pangmingyu/input/yarn-site.xml[pangmingyu@Centos1 opt]$ 

这些文件都是从其他地方拉进来的,比如。

./bin/hdfs dfs -mkdir input./bin/hdfs dfs -put ./etc/hadoop/*.xml input

输出情况:

[pangmingyu@Centos1 opt]$  hdfs dfs -ls /opt/output17/12/21 12:11:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 2 items-rw-r--r--   1 yangzhenyu supergroup          0 2017-12-21 12:02 /opt/output/_SUCCESS-rw-r--r--   1 yangzhenyu supergroup      10426 2017-12-21 12:02 /opt/output/part-r-00000

6 .jar打包执行

File–>project structure–>Aritifacts–>点击“+”

这里写图片描述

之后一步一步往下填参数。
这里写图片描述

填完后,退出,执行如下步骤。
这里写图片描述

然后到jar包的生成目录看,会看到jar包:wordcount_1.jar

之后,将jar包拉进Linux系统下,执行:

hadoop jar wordcount_1.jar job.WordCount hdfs://192.168.179.128:9000/user/pangmingyu/input hdfs://192.168.179.128:9000/user/pangmingyu/output

job.WordCount 代表执行类,job是类所在的package。


最后加上一篇博客,同样是hadoop 编写wordcount,写的不错:
https://www.polarxiong.com/archives/Hadoop-Intellij%E7%BB%93%E5%90%88Maven%E6%9C%AC%E5%9C%B0%E8%BF%90%E8%A1%8C%E5%92%8C%E8%B0%83%E8%AF%95MapReduce%E7%A8%8B%E5%BA%8F-%E6%97%A0%E9%9C%80%E6%90%AD%E8%BD%BDHadoop%E5%92%8CHDFS%E7%8E%AF%E5%A2%83.html

原创粉丝点击