hadoop+intellij+maven实现wordcount程序
来源:互联网 发布:java程序设计大学课本 编辑:程序博客网 时间:2024/06/14 05:15
前言:由于工作原因,第一次接触maven和hadoop,所以在学习的过程中,看了很多博客,踩了很多坑,也总结了一些经验,现在大致记录如下。有需要的朋友请自取。
1.什么是maven
具体可以这两篇博客。
maven入门http://www.cnblogs.com/now-fighting/p/4857625.html
maven体系结构https://www.cnblogs.com/now-fighting/p/4858982.html
2.hadoop集群配置
具体可以看这个系列:http://www.powerxing.com/install-hadoop/
值得注意的是,我的环境是centos7,本地没有jdk版本,只有jre,因此我自己下了一个jdk。其实 不运行java程序的话,jre就足够,但我还是下载了。
3.intellij构建一个maven项目
参考这篇博客:http://blog.csdn.net/qq_32588349/article/details/51461182
我的环境情况是:
Windows执行maven项目,hadoop集群在centos7上,我配的是伪分布式集群。
我配置的pom.xml如下:
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.wordcount_1</groupId> <artifactId>wordcount_1</artifactId> <version>1.0-SNAPSHOT</version> <packaging>jar</packaging> <repositories> <repository> <id>apache</id> <url>http://maven.apache.org</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.0</version> </dependency> </dependencies>
**配置hadoop需要如下依赖:
基础依赖hadoop-core和hadoop-common;
如果需要读写HDFS,则还需要依赖hadoop-hdfs和hadoop-client;如果需要读写HBase,则还需要依赖hbase-client**
配置完pom.xml后,编写代码。我的项目的逻辑结构如下:
WordCount程序:
package job;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context)throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one);}}}public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);}}public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2);} Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
注意要在resources目录下放三个文件。
hdfs-site.xml
core-site.xml
log4.properties
这样一来,能够控制日志输出格式,以及默认使用hdfs存储输入输出。否则默认使用本地输入输出。这几个文件在hadoop安装目录下的conf文件夹。
4.intellij编译执行
选择run–>edit configuration–>选择Application。填写main.Class。
在program.arguments配置好参数。这两个参数是输入输出参数,都是hdfs存储。
5.查看输入输出情况
输入情况
[pangmingyu@Centos1 opt]$ hdfs dfs -ls /user/pangmingyu/input/17/12/21 12:09:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 8 items-rw-r--r-- 1 pangmingyu supergroup 4436 2017-12-18 14:52 /user/pangmingyu/input/capacity-scheduler.xml-rw-r--r-- 1 pangmingyu supergroup 997 2017-12-18 14:52 /user/pangmingyu/input/core-site.xml-rw-r--r-- 1 pangmingyu supergroup 9683 2017-12-18 14:52 /user/pangmingyu/input/hadoop-policy.xml-rw-r--r-- 1 pangmingyu supergroup 1346 2017-12-18 14:52 /user/pangmingyu/input/hdfs-site.xml-rw-r--r-- 1 pangmingyu supergroup 620 2017-12-18 14:52 /user/pangmingyu/input/httpfs-site.xml-rw-r--r-- 1 pangmingyu supergroup 3523 2017-12-18 14:52 /user/pangmingyu/input/kms-acls.xml-rw-r--r-- 1 pangmingyu supergroup 5511 2017-12-18 14:52 /user/pangmingyu/input/kms-site.xml-rw-r--r-- 1 pangmingyu supergroup 690 2017-12-18 14:52 /user/pangmingyu/input/yarn-site.xml[pangmingyu@Centos1 opt]$
这些文件都是从其他地方拉进来的,比如。
./bin/hdfs dfs -mkdir input./bin/hdfs dfs -put ./etc/hadoop/*.xml input
输出情况:
[pangmingyu@Centos1 opt]$ hdfs dfs -ls /opt/output17/12/21 12:11:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 2 items-rw-r--r-- 1 yangzhenyu supergroup 0 2017-12-21 12:02 /opt/output/_SUCCESS-rw-r--r-- 1 yangzhenyu supergroup 10426 2017-12-21 12:02 /opt/output/part-r-00000
6 .jar打包执行
File–>project structure–>Aritifacts–>点击“+”
之后一步一步往下填参数。
填完后,退出,执行如下步骤。
然后到jar包的生成目录看,会看到jar包:wordcount_1.jar
之后,将jar包拉进Linux系统下,执行:
hadoop jar wordcount_1.jar job.WordCount hdfs://192.168.179.128:9000/user/pangmingyu/input hdfs://192.168.179.128:9000/user/pangmingyu/output
job.WordCount 代表执行类,job是类所在的package。
最后加上一篇博客,同样是hadoop 编写wordcount,写的不错:
https://www.polarxiong.com/archives/Hadoop-Intellij%E7%BB%93%E5%90%88Maven%E6%9C%AC%E5%9C%B0%E8%BF%90%E8%A1%8C%E5%92%8C%E8%B0%83%E8%AF%95MapReduce%E7%A8%8B%E5%BA%8F-%E6%97%A0%E9%9C%80%E6%90%AD%E8%BD%BDHadoop%E5%92%8CHDFS%E7%8E%AF%E5%A2%83.html
- hadoop+intellij+maven实现wordcount程序
- Hadoop(4-2)-MapReduce程序案例-WordCount(Intellij Idea环境)
- hadoop的第一个程序wordcount实现
- Hadoop流实现WordCount程序样例
- Intellij Idea 分别用Java和scala 实现wordcount程序
- hadoop运行WordCount程序
- Hadoop Wordcount 程序 详解
- Hadoop 运行 Wordcount程序
- Hadoop wordcount程序说明
- Eclipse实现Hadoop WordCount
- Hadoop的WordCount实现
- hadoop-python——Wordcount程序:python实现详解
- 使用SAS实现HADOOP Map/Reduce程序-wordcount
- maven构建Scala程序,实现spark的wordcount
- IntelliJ IDEA + Maven环境编写第一个hadoop程序
- Hadoop示例程序WordCount详解
- Hadoop示例程序WordCount详解
- 改写Hadoop的wordcount程序
- 每天学习API之二 ,zepto源码camelize, dasherize
- C#中使用OpenGL:(七)创建OpenGL渲染环境
- 用户态 内核态
- 微信小程序常用代码
- java 多线程(1) wait和notifier实例
- hadoop+intellij+maven实现wordcount程序
- webdriverapi
- Java 泛型
- 倒排索引
- Java Spring MVC进阶(2)--@produces、@PathVariable、@RequestParam等
- OpenCV中的均值与最值的计算
- TCP状态转换图
- ThinkPHP3.23整合phpqrcode生成二维码(logo)
- python多线程,获取多线程的返回值