Getting Spark Setup in Eclipse

来源：互联网发布：昆仑万维知乎编辑：程序博客网时间：2024/06/16 08:46

http://syndeticlogic.net/?p=311

Spark is a new distributed programming framework for analyzing large data sets. It took me a few steps to get the system setup in Eclipse, so I thought I’d write them down. Hopefully this post saves someone a few minutes.

Fair warning, the Spark project seems to be moving fast, so this could get out of date quickly…

Building from Source

First download the sources from the Git repository. Then try to build it. To build it you need to specify a profile. Below are the commands I used to accomplish these steps.

$ git clone github.com:mesos.git/spark$ mvn -U -Phadoop2 clean install -DskipTests

Unfortunately, that didn’t just work for me. I have reason to believe the issue is environmental (see below), so it might work for you.

If this step works for you, then move on to the next section. Below is the build error I received.

[ERROR] Failed to execute goal on project spark-core: Could notresolve dependencies for projectorg.spark-project:spark-core:jar:0.7.1-SNAPSHOT: The followingartifacts could not be resolved: cc.spray:spray-can:jar:1.0-M2.1,cc.spray:spray-server:jar:1.0-M2.1,cc.spray:spray-json_2.9.2:jar:1.1.1: Could not find artifactcc.spray:spray-can:jar:1.0-M2.1 in jboss-repo(http://repository.jboss.org/nexus/content/repositories/releases/)

This error is bit misleading. The repository.jboss.org is just the last repo missing the artifacts. After inspecting spark/pom.xml, the problem is that mvn cannot download the jars from repo.spray.cc. The spark/pom.xml seems to be correct, and, surprisingly,repo.spray.cc seems to be okay too.

The spray docs indicate repo.spray.io is the maven repo. But both domains point the same IP address. For sanity, I tried it, but had the same problem.

The work around is to put the files in the .m2 repository manually. Below is the script I used.

for k in can io util server base; do dir="cc/spray/spray-$k/1.0-M2.1/" mkdir -p ~/.m2/repository/$dir cd ~/.m2/repository/$dir wget http://repo.spray.io/$dir/spray-$k-1.0-M2.1.pom wget http://repo.spray.io/$dir/spray-$k-1.0-M2.1.jardonedir="cc/spray/spray-json_2.9.2/1.1.1"mkdir -p ~/.m2/repository/$dircd ~/.m2/repository/$dirwget http://repo.spray.io/$dir/spray-json_2.9.2-1.1.1.jarwget http://repo.spray.io/$dir/spray-json_2.9.2-1.1.1.pomdir="cc/spray/twirl-api_2.9.2/0.5.2"mkdir -p ~/.m2/repository/$dircd ~/.m2/repository/$dirwget http://repo.spray.io/$dir/twirl-api_2.9.2-0.5.2.jarwget http://repo.spray.io/$dir/twirl-api_2.9.2-0.5.2.pom

This really sucks, but it works for this error. I found a stackoverflow regarding a similar mvn issue – 1 poster claimed that downgrading to java 6 fixed it. It seems strange that it would be a java 7 issue, but I’ve encountered stranger things. I didn’t test downgrading.

For reference, below is my environment.

james@minerva:~/spark$ mvn -versionApache Maven 3.0.4Maven home: /usr/share/mavenJava version: 1.7.0_17, vendor: Oracle CorporationJava home: /usr/lib/jvm/java-7-oracle/jreDefault locale: en_US, platform encoding: UTF-8OS name: "linux", version: "3.2.0-38-generic", arch: "amd64",family: "unix"

Eclipse Setup

The Eclipse setup is pretty straight forward. But if you’ve never done a Java/Scala Eclipse setup it can take a couple hours to figure out what needs to happen.

From within Eclipse, install EGit and the Scala IDE plugin. Pay attention to the version of Eclipse and Scala. At the time of this writing Spark is based on Scala 2.9.2 and I was running Juno.

I never, ever use the m2eclipse plugin. Some people I know use it successfully, but not me. I use mvn to generate the .project and .classpath files. I don’t know anyone that mixes these approaches.

Below is the command that I used to generate the project files.

$ mvn -Phadoop2 eclipse:clean eclipse:eclipse

Next, import the projects from Git (at this time that includes spark-core, spark-bagel, spark-repl, spark-streaming and spark-examples). To do this, select File->import->Projects from git.

Next, we need to connect the Scala IDE plugin with each project that has Scala source files (spark-core, spark-bagel, spark-repl and spark-streaming). To do so right-click on the project and select Configure->Add Scala Nature. Below is a picture.

Next, we need to add the Scala source folders to the build path (each src/main/scala and src/test/scala folder). To accomplish this, right-click on the folder and select Add to Build Path->Use As Source Folder.

Spark mixes .java and .scala files in a non-standard way that can confuse the Scala IDE plugin, so we need to make sure that all the source folders include .scala files in the classpath. To check if this is the case, look at the .classpath. It should have an entry like the following for all the scala source folders.

 <classpathentry including="**/*.java|**/*.scala" kind="src"  path="src/main/scala"/>

If the there is no **/*.scala in the classpathentry for any source folder with Scala code in it, then we need to add it. It can be added via Eclipse through the GUI, or we can edit the .classpath file directly.

Inclusion filters can be added from the Eclipse GUI by right-clicking on the source folder and selectiong Build Path->Configure Inclusion/Exclusion Filters and add **/*.scala.

Finally, add spark-core to the build path of spark-repl and spark-streaming. To do this, right-click on the project and Add to Build Path->Configure Build Path->Add projects(then select spark-core).

0 3