maven环境下使用java、scala混合开发spark应用
来源:互联网 发布:淘宝授权怎么弄 编辑:程序博客网 时间:2024/06/05 00:57
熟悉java的开发者在开发spark应用时,常常会遇到spark对java的接口文档不完善或者不提供对应的java接口的问题。这个时候,如果在java项目中能直接使用scala来开发spark应用,同时使用java来处理项目中的其它需求,将在一定程度上降低开发spark项目的难度。下面就来探索一下java、scala、spark、maven这一套开发环境要怎样来搭建。
1、下载scala sdk
http://www.scala-lang.org/download/ 直接到这里下载sdk,目前最新的稳定版为2.11.7,下载后解压就行
(后面在intellijidea中创建.scala后缀源代码时,ide会智能感知并提示你设置scala sdk,按提示指定sdk目录为解压目录即可)
也可以手动配置scala SDK:ideal =>File =>project struct.. =>library..=> +...
2、下载scala forintellij idea的插件
如上图,直接在plugins里搜索Scala,然后安装即可,如果不具备上网环境,或网速不给力。也可以直接到http://plugins.jetbrains.com/plugin/?idea&id=1347手动下载插件的zip包,手动下载时,要特别注意版本号,一定要跟本机的intellij idea的版本号匹配,否则下载后无法安装。下载完成后,在上图中,点击“Install plugin from disk...”,选择插件包的zip即可。
3、如何跟maven整合
使用maven对项目进行打包的话,需要在pom文件中配置scala-maven-plugin这个插件。同时,由于是spark开发,jar包需要打包为可执行java包,还需要在pom文件中配置maven-assembly-plugin和maven-shade-plugin插件并设置mainClass。经过实验摸索,下面贴出一个可用的pom文件,使用时只需要在包依赖上进行增减即可使用。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>my-project-groupid</groupId> <artifactId>sparkTest</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>sparkTest</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hbase.version>0.98.3</hbase.version> <!--<spark.version>1.3.1</spark.version>--> <spark.version>1.6.0</spark.version> <jdk.version>1.7</jdk.version> <scala.version>2.10.5</scala.version> <!--<scala.maven.version>2.11.1</scala.maven.version>--> </properties> <repositories> <repository> <id>repo1.maven.org</id> <url>http://repo1.maven.org/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>repository.jboss.org</id> <url>http://repository.jboss.org/nexus/content/groups/public/ </url> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>cloudhopper</id> <name>Repository for Cloudhopper</name> <url>http://maven.cloudhopper.com/repos/third-party/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>mvnr</id> <name>Repository maven</name> <url>http://mvnrepository.com/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>scala</id> <name>Scala Tools</name> <url>https://mvnrepository.com/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala</id> <name>Scala Tools</name> <url>https://mvnrepository.com/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version>${scala.version}</version> <scope>compile</scope> </dependency> <!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api --> <dependency> <groupId>javax.mail</groupId> <artifactId>javax.mail-api</artifactId> <version>1.4.7</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-graphx_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-graphx_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.30</version> </dependency> <!--<dependency>--> <!--<groupId>org.spark-project.akka</groupId>--> <!--<artifactId>akka-actor_2.10</artifactId>--> <!--<version>2.3.4-spark</version>--> <!--</dependency>--> <!--<dependency>--> <!--<groupId>org.spark-project.akka</groupId>--> <!--<artifactId>akka-remote_2.10</artifactId>--> <!--<version>2.3.4-spark</version>--> <!--</dependency>--> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>14.0.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.3</version> </dependency> <dependency> <groupId>p6spy</groupId> <artifactId>p6spy</artifactId> <version>1.3</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-math3</artifactId> <version>3.3</version> </dependency> <dependency> <groupId>org.jdom</groupId> <artifactId>jdom</artifactId> <version>2.0.2</version> </dependency> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>14.0.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>2.6.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>0.98.6-hadoop2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase</artifactId> <version>0.98.6-hadoop2</version> <type>pom</type> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-common</artifactId> <version>0.98.6-hadoop2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>0.98.6-hadoop2</version> </dependency> <dependency> <groupId>org.testng</groupId> <artifactId>testng</artifactId> <version>6.8.8</version> <scope>test</scope> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.30</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.jaxrs</groupId> <artifactId>jackson-jaxrs-json-provider</artifactId> <version>2.4.4</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.4.4</version> </dependency> <dependency> <groupId>net.sf.json-lib</groupId> <artifactId>json-lib</artifactId> <version>2.4</version> <classifier>jdk15</classifier> </dependency> <!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api --> <dependency> <groupId>javax.mail</groupId> <artifactId>javax.mail-api</artifactId> <version>1.4.7</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <!--<打包后的项目必须spark submit方式提交给spark运行,勿使用java -jar运行java包>--> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <appendAssemblyId>false</appendAssemblyId> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>rrkd.dt.sparkTest.HelloWorld</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>assembly</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>${jdk.version}</source> <target>${jdk.version}</target> <encoding>${project.build.sourceEncoding}</encoding> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.1</version> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <shadedArtifactAttached>true</shadedArtifactAttached> <shadedClassifierName>allinone</shadedClassifierName> <artifactSet> <includes> <include>*:*</include> </includes> </artifactSet> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>reference.conf</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>rrkd.dt.sparkTest.HelloWorld</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> <!--< build circular dependencies between Java and Scala>--> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <id>compile-scala</id> <phase>compile</phase> <goals> <goal>add-source</goal> <goal>compile</goal> </goals> </execution> <execution> <id>test-compile-scala</id> <phase>test-compile</phase> <goals> <goal>add-source</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> </configuration> </plugin> </plugins> </build></project>主要是build部分的配置,其它的毋须过多关注。
项目的目录结构,大体跟maven的默认约定一样,只是src下多了一个scala目录,主要还是为了便于组织java源码和scala源码,如下图:
在java目录下建立HelloWorld类HelloWorld.class:
package test;import test.Hello;/** * Created by L on 2017/1/5. */public class HelloWorld { public static void main(String[] args){ System.out.print("test"); Hello.sayHello("scala"); Hello.runSpark(); }}
在scala目录下建立hello类hello.scala:
package testimport org.apache.spark.graphx.{Graph, Edge, VertexId, GraphLoader}import org.apache.spark.rdd.RDDimport org.apache.spark.{SparkContext, SparkConf}import breeze.linalg.{Vector, DenseVector, squaredDistance}/** * Created by L on 2017/1/5. */object Hello { def sayHello(x: String): Unit = { println("hello," + x); }// def main(args: Array[String]) {def runSpark() { val sparkConf = new SparkConf().setAppName("SparkKMeans").setMaster("local[*]") val sc = new SparkContext(sparkConf) // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")), (4L, ("peter", "student")))) // Create an RDD for edges val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), Edge(4L, 0L, "student"), Edge(5L, 0L, "colleague"))) // Define a default user in case there are relationship with missing user val defaultUser = ("John Doe", "Missing") // Build the initial Graph val graph = Graph(users, relationships, defaultUser) // Notice that there is a user 0 (for which we have no information) connected to users // 4 (peter) and 5 (franklin). graph.triplets.map( triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1 ).collect.foreach(println(_)) // Remove missing vertices as well as the edges to connected to them val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing") // The valid subgraph will disconnect users 4 and 5 by removing user 0 validGraph.vertices.collect.foreach(println(_)) validGraph.triplets.map( triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1 ).collect.foreach(println(_)) sc.stop() }}
这样子,在scala项目中调用spark的接口来运行一些spark应用,在java项目中再调用scala。
4、scala项目maven的编译打包
java/scala混合的项目,怎么先编译scala再编译java,可以使用以下maven 命令来进行编译打包:
mvn clean scala:compile assembly:assembly
5、spark项目的jar包的运行问题
在开发时,我们可能会以local模式在IDEA中运行,然后使用了上面的命令进行打包。打包后的spark项目必须要放到spark集群下以spark-submit的方式提交运行。
- maven环境下使用java、scala混合开发spark应用
- 使用IDEA构建Spark Scala开发环境(支持maven)
- Intellij IDEA使用Maven搭建spark开发环境(scala)
- idea使用maven构建java和scala项目开发spark
- Scala java maven 混合开发 pom配置
- Spark开发环境搭建之使用Scala和maven的pom文件
- win7下eclipse+spark+scala+maven环境搭建及实例
- maven混合编译java&scala
- Eclipse下Java+Scala混合编程的Maven项目
- scala+java+eclipse+maven开发环境构建
- Windows下单机安装scala-Spark开发环境
- Spark+Scala+intellij在win7下开发环境配置
- hadoop spark环境搭建及idea scala maven集成开发spark任务
- spark 之 Scala 环境搭建,开发工具使用
- mac环境下使用emacs开发scala
- 走进Spark生态圈:使用Maven构建Spark开发环境
- Eclipse+maven+scala+spark环境搭建
- Eclipse+maven+scala+spark环境搭建
- 非常详细的rsyslogd & logrotate配置文件解析
- BZOJ2243: [SDOI2011]染色
- 获取页面标签中的href值
- Javascript中的JSON解析
- centos6.8 mysql-5.6.21源码搭建笔记
- maven环境下使用java、scala混合开发spark应用
- 链表
- [leetcode-476]Number Complement
- 几大坐标系的转换
- 在主机windows中无法访问虚拟机Linux centos中tomcat服务器
- kafka生产者
- [Wondgirl]ECMAScript6(ES6)(四)数组的解构赋值
- php学习笔记-数组篇(3)
- 两种方法输出以下内容''' 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 '''