spark(2)-入门spark之java maven wordcount实验
来源:互联网 发布:淘宝midway 编辑:程序博客网 时间:2024/05/18 00:03
- java 实现 spark wordcount实验
- 1 spark官网 WordCount Example实验
- 2 其他版本
- java 实现 spark wordcount实验
在上一篇spark(1)-入门spark之scala sbt wordcount实验的基础上,继续学习java版本wordcount实验。
1. java 实现 spark wordcount实验
1 (eclipse)新建maven项目,quick-start
Group Id:cn.whbing.sparkAtifacts Id:SparkApps
2 JRE(SystemLibrary)
改成workSpaceDefaultJRE(java8)
3 新建包并建WordCount类:
4 pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>cn.whbing.spark</groupId> <artifactId>SparkApps</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>SparkApps</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency><!--2_11.2.1.2版本的FlatMapFunction的实现方法有区别--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.2</version> </dependency><!-- 以下依赖暂时不需要 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.1.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.1.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.1.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.11</artifactId> <version>1.6.3</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-graphx_2.11</artifactId> <version>2.1.2</version> </dependency>--> </dependencies><build> <sourceDirectory>src/main/java</sourceDirectory> <testSourceDirectory>src/test/java</testSourceDirectory> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass></mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>1.2.1</version> <executions> <execution> <goals> <goal>exec</goal> </goals> </execution> </executions> <configuration> <executable>java</executable> <includeProjectDependencies>false</includeProjectDependencies> <includePluginDependencies>false</includePluginDependencies> <classpathScope>compile</classpathScope> <!-- <mainClass>cn.whbing.spark.App</mainClass>--> <mainClass>cn.whbing.spark.SparkApps.WordCount2</mainClass> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build></project>
说明:(maven的一些问题及解决)
A) eclipse使用maven时,在windows
=>preferences
中配置maven setting.xml
及repo
位置。(这里指定的setting.xml可以不是maven conf中的xml,但是maven conf中的xml是CLI使用的xml)下载依赖如使用第三方仓库可能会出问题,直接科学上网+自带中央仓库(注释掉mirror即可)即可。出现问题,请查看中央仓库是否有对应的版本。
B)如中途停止了下载,则不会再二次下载,需要到repo目录下删除对应的组件目录,再更新即可。
pom.xml如下(可能有很多组件第一个实验中未用到,但这里还是一并给出)
C)关于maven插件出问题,如mvn clean,assembly,compile等出问题,找到对应的文件夹repo\org\apache\maven\plugins
删掉
D) mvn打包跳过测试
mvn package -Dmaven.test.skip=true
1.1 spark官网 WordCount Example实验
5 spark官网示例
url:http://spark.apache.org/examples.html
WordCountExample.java
package cn.whbing.spark.SparkApps.cores;import java.util.Arrays;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;/** * 官方示例版本 */import scala.Tuple2;public class WordCountExample {public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("Spark WordCount written by java!"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> textFile = sc.textFile("hdfs:///whbing/HelloSpark.txt"); JavaPairRDD<String, Integer> counts = textFile .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b); counts.saveAsTextFile("hdfs:///home/whbing/HelloSpark2"); sc.close(); }}
6 mvn打包上传。
7 运行脚本:
#5.运行java wordcount 官网Example示例,输出保存到文件./bin/spark-submit --class cn.whbing.spark.SparkApps.cores.WordCountExample --master spark://master-1a:7077 /home/whbing/SparkApps-0.0.1-SNAPSHOT.jar
8 hdfs中查看结果(50070端口界面)
1.2 其他版本
9 WourdCount2.java 输入作为参数
package cn.whbing.spark.SparkApps.cores;import java.util.Arrays;import java.util.Iterator;import java.util.List;import java.util.regex.Pattern;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.FlatMapFunction;import org.apache.spark.api.java.function.Function2;import org.apache.spark.api.java.function.PairFunction;import scala.Tuple2;public final class WordCount2 { private static final Pattern SPACE = Pattern.compile(" "); public static void main(String[] args) throws Exception { /** * 待读取的文件以参数的形式输入 */ if (args.length < 1) { System.err.println("Usage: JavaWordCount <file>"); System.exit(1); } SparkConf conf = new SparkConf().setAppName("JavaWordCount"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile(args[0],1); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { private static final long serialVersionUID = 1L; @Override public Iterator<String> call(String s) { return Arrays.asList(SPACE.split(s)).iterator(); } }); JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { private static final long serialVersionUID = 1L; @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); List<Tuple2<String, Integer>> output = counts.collect(); for (Tuple2<?, ?> tuple : output) { System.out.println(tuple._1() + ": " + tuple._2()); } sc.stop(); }}
10 打包 mvn clean package并上传。
11 脚本:
#4.运行java wordcount示例,输入文件以参数输入./bin/spark-submit --class cn.whbing.spark.SparkApps.cores.WordCount2 --master spark://master-1a:7077 /home/whbing/SparkApps-0.0.1-SNAPSHOT.jar hdfs:///whbing/HelloSpark.txt
12 打印结果
... ...17/12/05 21:19:04 INFO DAGScheduler: Job 0 finished: collect at WordCount2.java:64, took 1.549826 sspark: 1whut: 1hello: 3world: 117/12/05 21:19:04 INFO ServerConnector: Stopped Spark@2f63dff2{HTTP/1.1}{0.0.0.0:4040}... ....
至此,实验完成。
13 附:流程
step1
: 创建spark的配置对象SparkConf,配置Saprk运行时的配置信息。
setMaster设置Spark集群的master URL。若为”local”表本地,一般可不配,提交命令时选择。
step2
: 创建SparkContext对象
SparkContext是spark程序的唯一入口。无论采用scala,java,python,R都必须有一个sparkContext
SparkContext核心作用
:初始化Spark程序运行时核心组件,包括DAGScheduler、TeskScheduler、SchedulerBackend同时还会负责Spark程序往master中注册程序
SparkContext是Spark程序中最为重要的
step3:
根据具体的数据来源(HDFD、HBase、Local FS、DB、S3等)通过SparkContext来创建RDD
RDD的创建基本有三种方式:外部数据来源(如HDFS)、根据scala集合、由其他的RDD操作
数据会被RDD划分为一系列的partitions,分配到威哥partition的数据属于一个task的处理范畴
step4:
对初始的javaRDD进行transformation级别的处理,例如map,filter等高级函数的编程
step4.1:
将每一行字符串拆分成单个的单词
如果是scala,可以SAM转化,直接val word =line.flatMap{line=>line.split(” “)}
step4.2:
在单词拆分的基础上对每个单词实例计数为1,也就是word=>(word,1)
step4.3:
在每个单词实例计数为1的基础上统计每个单词在文本中出现的总次数
- spark(2)-入门spark之java maven wordcount实验
- spark(1)-入门spark之scala sbt wordcount实验
- spark入门之wordcount
- Spark入门之WordCount
- spark入门wordcount详解(JAVA)
- Spark之WordCount(Java代码实现)
- Spark之java操作WordCount
- Spark之java操作WordCount
- Spark入门-WordCount
- Spark入门的WordCount
- spark入门实例 wordCount
- Spark Streaming开发入门——WordCount(Java&Scala)
- Spark Streaming开发入门——WordCount(Java&Scala)
- Spark之WordCount
- Spark初试之WordCount
- Spark之WordCount
- Spark学习之WordCount
- Spark之wordcount
- The fully qualified name of the bean's class, except if it serves only as a parent definition ...
- Tips: JQuery 多元素选择并转化成数组
- CodeForces
- 负载均衡高可用核心概念及常用软件
- sap 寄售业务流程
- spark(2)-入门spark之java maven wordcount实验
- 代码简化之路
- redis分布式锁redisson(下)
- sql 基础
- 1266谁是冠军(重写的)
- 部门人员多选页面---
- 数组,杨辉,矩阵翻转
- CSS3 用户界面
- Spring Boot 入门教程