Distributed configure (hadoop 2.7.2 & spark 2.1.0)

Distributed configure (hadoop 2.7.2 & spark 2.1.0)

1. environment

Hadoop 2.7.2spark 2.1.0scala 2.11.8sbt 0.13.15java 1.8maven 3.3.9protobuf 2.5.0findbugs 2.0.2

2. configure details

2.1 download the required software

  1. download the hadoop source code from https://dist.apache.org/repos/dist/release/hadoop/common/
  2. download the spark source code from http://spark.apache.org/downloads.html
  3. download scala from http://www.scala-lang.org/download/all.html
  4. download sbt from http://www.scala-sbt.org/download.html
  5. download java from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
  6. download maven from http://maven.apache.org/download.cgi
  7. download protobuf from https://github.com/google/protobuf/tree/master/src
  8. download findbugs from https://sourceforge.net/projects/findbugs/?source=typ_redirect

2.1.1 configure the requirements java8 environment

First need to remove the java environment on the system in present.

# see the all the java environmentrpm -qa | grep java# then remove it by do thisrpm -e --nodeps XXXXX   # XXXXX is what you see when type 'rpm -qa | grep java'# upload the JDK8 'jdk-8u131-linux-x64.tar.gz' which can be download from oracle official websitetar -zxvf jdk-8u131-linux-x64.tar.gzvim /etc/profile# add this code into the file.JAVA_HOME=/usr/local/java/jdk1.8.0_131             # be care of the path.JRE_HOME=JAVA_HOME/jrePATH=$JAVA_HOME/bin:$PATHCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/libexport PATH JAVA_HOME CLASSPATHsource /etc/profile

Type java -version and javac in the console, if you see these message, that means java is installed successfully!

java version "1.8.0_131"Java(TM) SE Runtime Environment (build 1.8.0_131-b11)Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)Usage: javac <options> <source files>where possible options include:  -g                         Generate all debugging info  -g:none                    Generate no debugging info  -g:{lines,vars,source}     Generate only some debugging info  -nowarn                    Generate no warnings  -verbose                   Output messages about what the compiler is doing  -deprecation               Output source locations where deprecated APIs are used  -classpath <path>          Specify where to find user class files and annotation processors  -cp <path>                 Specify where to find user class files and annotation processors  -sourcepath <path>         Specify where to find input source files  -bootclasspath <path>      Override location of bootstrap class files  -extdirs <dirs>            Override location of installed extensions  -endorseddirs <dirs>       Override location of endorsed standards path​```````````` scala-2.11.8 environment
tar -zxvf scala-2.11.8.tgzvim /etc/profile# add the following code into the file.export SCALA_HOME=/usr/local/scala/scala-2.11.8export PATH=$PATH:$SCALA_HOME/binsource /etc/profile

Type scala -version in the console, if it appears these messages, that means scala is installed successfully!.

Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL sbt-0.13.15 environment
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repoyum install sbtsbt

Type sbt sbt-version in the console, if it appears these messages, that means sbt is installed successfully!.

[info] Set current project to sbt (in build file:/usr/local/sbt/)[info] 0.13.15 maven-3.3.9 environment
tar -zxvf apache-maven-3.3.9-bin.tar.gz# add the following code into the file.export MAVEN_HOME=/usr/local/maven/apache-maven-3.3.9export PATH=$PATH:$MAVEN_HOME/binsource /etc/profile

Type mvn -v in the console, if it appears these messages, that means maven is installed successfully!.

Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T11:41:47-05:00)Maven home: /usr/local/maven/apache-maven-3.3.9Java version: 1.8.0_131, vendor: Oracle CorporationJava home: /usr/local/java/jdk1.8.0_131/jreDefault locale: en_US, platform encoding: UTF-8OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"

2.2 configure the distributed system

2.2.1 configure Hadoop compile Hadoop

first, uncompress hadoop-2.7.2-src.tar.gz.

tar -zxvf hadoop-2.7.2-src.tar.gz

second, download and install maven, protobufs.

tar -zxvf apache-maven-3.3.9-bin.tar.gzcd /apache-maven-3.3.9vim /etc/profileexport MAVEN_HOME=/your directory/apache-maven-3.3.9export PATH=.:$PATH:$JAVA_HOME/bin:$MAVEN_HOME/binsource /etc/profileln -s /your directory/apache-maven-3.5.0/bin/mvn /usr/bin/mvntar -zxvf protobuf-2.5.0.tar.gzcd /protobuf-2.5.0./configure --prefix=/your director/protobuf-2.5.0makemake installvim /etc/profile# add the following code into the file.export PATH=$PATH:/usr/local/protobuf/protobuf-2.5.0/bin/export PKG_CONFIG_PATH=/usr/local/protobuf/protobuf-2.5.0/lib/pkgconfig/source /etc/profile

third, dowload hadoop library using maven and then compile it.

cd hadoop-2.7.2-srcmvn clean package -Pdist,native -DskipTests -Dtar 

​ when compile hadoop source code, there may appear such problem:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (dist) on project hadoop-dist: An Ant BuildException has occured: exec returned: 1[ERROR] around Ant part ...<exec failonerror="true" dir="/usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target" executable="sh">... @ 38:104 in /usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target/antrun/build-main.xml

​ That you need to download four dependencies:

  1. cmake : yum install cmake(in centos)

  2. findbugs : https://sourceforge.net/projects/findbugs/?source=typ_redirect dowload and uncompress it and then

    yum install antunzip findbugs-2.0.2-source.zipcd /your findbugs directoryantexport FINDBUGS_HOME=/usr/local/findbugs/findbugs-2.0.2
  3. openssl-dev : yum install openssl-devel(in centos)

  4. zlib-dev : yum install zlib-devel(in centos)

finally, when it appears the following messages, that means hadoop compiled successfully!

[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 21:45 min[INFO] Finished at: 2017-04-24T19:20:15+08:00[INFO] Final Memory: 119M/419M[INFO] ------------------------------------------------------------------------

Now we can get the compiled code from:

/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz configure Hadoop in Standalone Operation

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. And we can test it right now:

mkdir distributed_inputcp /your directory/hadoop-/hadoop-2.7.2/etc/hadoop/*.xml distributed_input./your directory/hadoop-2.7.2/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /usr/local/distributed_input /usr/local/distributed_output 'dfs[a-z.]+'cat distributed_output/*

After a few seconds, the console will appear these messages:

    File System Counters        FILE: Number of bytes read=1153568        FILE: Number of bytes written=2210810        FILE: Number of read operations=0        FILE: Number of large read operations=0        FILE: Number of write operations=0    Map-Reduce Framework        Map input records=1        Map output records=1        Map output bytes=17        Map output materialized bytes=25        Input split bytes=134        Combine input records=0        Combine output records=0        Reduce input groups=1        Reduce shuffle bytes=25        Reduce input records=1        Reduce output records=1        Spilled Records=2        Shuffled Maps =1        Failed Shuffles=0        Merged Map outputs=1        GC time elapsed (ms)=0        Total committed heap usage (bytes)=638582784    Shuffle Errors        BAD_ID=0        CONNECTION=0        IO_ERROR=0        WRONG_LENGTH=0        WRONG_MAP=0        WRONG_REDUCE=0    File Input Format Counters         Bytes Read=123    File Output Format Counters         Bytes Written=23

That means we have run the example successfully!

In order to invoke hadoop command, we can configure the profile by do this:

export HADOOP_HOME=/your directory/hadoop-2.7.2export PATH=$PATH:$HADOOP_HOME/bisource /etc/profile configure Hadoop in Pseudo-Distributed Operation

Pseudo-Distributed is also a single-node but it runs in multiple separate Java processes.

There are several files need to be modified:

hosts ssh network ifcfg-eth0 resolv.conf

# modify the hosts file.vim /etc/hosts# add the master and worker's ip and hostname# here is my example.   localhost localhost.localdomain localhost4 localhost4.localdomain4::1         localhost localhost.localdomain localhost6 localhost6.localdomain6192.168.159.132 master192.168.159.134 worker2192.168.159.133 worker3# modify the hostname.vim /etc/sysconfig/network# add the proper code, here is my example.NETWORKING=yesHOSTNAME=master # if it is worker then the HOSTNAME=worker'name# restart the networkservice network restart# modify the ssh key, because we need every machine can connect each other without password.ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsacat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keyschmod 0600 ~/.ssh/authorized_keys# above should be done in each machine, then copy the authorized_key to the master and finally send it to all worker.# log in worker2 and then:ssh-copy-id -i master # copy the authorized_key to the master from worker2# log in worker3 and then:ssh-copy-id -i master # copy the authorized_key to the master from worker3scp /root/.ssh/authorized_keys worker2:/root/.ssh/ # send the authorized_key to worker2scp /root/.ssh/authorized_keys worker3:/root/.ssh/ # send the authorized_key to worker3# make the directory for hadoop files/hdfs/logs and so on, the directory tree just like:distribute_data├── hadoop│   ├── data│   ├── hdfs│   ├── logs│   ├── name│   └── temp└── spark#################################################################################################### belows don't need to do.#################################################################################################### modify the network adapter.vim /etc/sysconfig/network-scripts/ifcfg-eth0# add the following code.DEVICE=eth0TYPE=EthernetONBOOT=yesNM_CONTROLLED=yesBOOTPROTO=staticDEFROUTE=yesIPV4_FAILURE_FATAL=yesIPV6INIT=noNAME="System eth0"HWADDR=00:02:c9:03:00:31:78:f2PEERDNS=yesPEERROUTES=yesIPADDR= modify the DNS# vim /etc/resolv.conf# add the following code(It depends on your system and network).nameserver


# modify the JAVA_HOME variable.export JAVA_HOME=/your JAVA_HOME directory.# here is mine.# export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-


# add all the hostname of the worker machine to the slaves file.worker2worker3


<configuration>        <property>                <name>fs.defaultFS</name>                <value>hdfs://master:9000</value>        </property>        <!-- Size of read/write buffer used in SequenceFiles. -->        <property>             <name>io.file.buffer.size</name>             <value>131072</value>       </property>        <!-- hadoop temp directory, it depends on you -->        <property>                <name>hadoop.tmp.dir</name>                <value>/usr/local/distribute_data/hadoop/temp</value>        </property></configuration>


<configuration><property>      <name>dfs.namenode.secondary.http-address</name>      <value>master:50090</value>    </property>    <property>      <name>dfs.replication</name>      <value>2</value>    </property>    <property>      <name>dfs.namenode.name.dir</name>      <value>file:/usr/local/distribute_data/hadoop/hdfs/name</value>    </property>    <property>      <name>dfs.datanode.data.dir</name>      <value>file:/usr/local/distribute_data/hadoop/hdfs/data</value>    </property></configuration>


<configuration> <property>          <name>mapreduce.framework.name</name>          <value>yarn</value>  </property>  <property>          <name>mapreduce.jobhistory.address</name>          <value>master:10020</value>  </property>  <property>          <name>mapreduce.jobhistory.address</name>          <value>master:19888</value>  </property></configuration>


<configuration><!-- Site specific YARN configuration properties -->     <property>          <name>yarn.nodemanager.aux-services</name>          <value>mapreduce_shuffle</value>     </property>     <property>           <name>yarn.resourcemanager.address</name>           <value>master:8032</value>     </property>     <property>          <name>yarn.resourcemanager.scheduler.address</name>          <value>master:8030</value>      </property>     <property>         <name>yarn.resourcemanager.resource-tracker.address</name>         <value>master:8031</value>     </property>     <property>         <name>yarn.resourcemanager.admin.address</name>         <value>master:8033</value>     </property>     <property>         <name>yarn.resourcemanager.webapp.address</name>         <value>master:8088</value>     </property></configuration>

After modified the slaves / core-site.xml / hdfs-site.xml / mapred-site.xml / yarn-site.xml files in master, then copy them to the workers, like that:

scp core-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp core-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/  scp hdfs-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/  scp hdfs-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/   scp yarn-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/scp yarn-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/scp mapred-site.xml.template worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/scp mapred-site.xml.template worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/

Finally, format the hdfs file system and start the distributed.

cd /your directory/hadoop-2.7.2./bin/hdfs namenode -format

if it comes up these messages, that means it format the file successfully!

``````17/04/26 16:02:20 INFO common.Storage: Storage directory /usr/local/distribute_data/hadoop/hdfs/name has been successfully formatted.17/04/26 16:02:20 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 017/04/26 16:02:20 INFO util.ExitUtil: Exiting with status 017/04/26 16:02:20 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at master/************************************************************/

and start the distributed

./sbin/start-all.sh  # actually the hadoop team recommend to use start-dfs.sh and start-yarn.sh

if it comes up these messages, that means it started successfully!

Starting namenodes on [master]master: starting namenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-namenode-master.outlocalhost: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-master.outworker2: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker2.outworker3: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker3.outStarting secondary namenodes [master]master: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-secondarynamenode-master.outstarting yarn daemonsstarting resourcemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-resourcemanager-master.outworker2: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker2.outlocalhost: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-master.outworker3: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker3.out

type jps in the master’s console, we can see:

18544 NodeManager17540 NameNode17764 DataNode18437 ResourceManager19557 Jps18092 SecondaryNameNode configure Hadoop in Fully-Distributed Operation

Fully-Distributed is very similar to Pseudo-Distributed, actually the configure step is same to above. Test the calculation of hadoop distributed.
cp /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/*.xml /usr/local/distribute_data/hadoop/data//your directory/hadoop-2.7.2/bin/hdfs dfs -mkdir /inhadoop dfs -put /usr/local/distribute_data/hadoop/data/* /in

2.2.2 configure spark compile spark

First, we need to add the MAVEN_OPTS to the profile, it will help us avoid the heap space error.

vim /etc/profileexport MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"source /etc/profile

Second, uncompress the spark source code and download spark library using maven and then compile it.

tar -zxvf spark-2.1.0.tgzcd /spark-2.1.0# need to declare the hadoop version, same to above version../build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -DskipTests clean package  # wait a minute.

​ if it appears these error message:

[INFO] ------------------------------------------------------------------------[INFO] BUILD FAILURE[INFO] ------------------------------------------------------------------------[INFO] Total time: 20.189 s[INFO] Finished at: 2017-04-27T14:58:14+08:00[INFO] Final Memory: 41M/211M[INFO] ------------------------------------------------------------------------[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-tags_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> [Help 1][ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.[ERROR] Re-run Maven using the -X switch to enable full debug logging.[ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles:[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException[ERROR] [ERROR] After correcting the problems, you can resume the build with the command[ERROR]   mvn <goals> -rf :spark-tags_2.11

​ make sure you have install the right version of scala-2.11.8 and maven-3.3.9, then reboot the system and type the same command again, actually it will fix the error by reboot.

​ if it appears these messages, that means we have download and compile spark libraries successfully!

[INFO] Spark Project Parent POM ........................... SUCCESS [ 13.441 s][INFO] Spark Project Tags ................................. SUCCESS [ 17.570 s][INFO] Spark Project Sketch ............................... SUCCESS [ 17.121 s][INFO] Spark Project Networking ........................... SUCCESS [ 22.711 s][INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 14.905 s][INFO] Spark Project Unsafe ............................... SUCCESS [ 22.380 s][INFO] Spark Project Launcher ............................. SUCCESS [ 22.645 s][INFO] Spark Project Core ................................. SUCCESS [10:59 min][INFO] Spark Project ML Local Library ..................... SUCCESS [04:20 min][INFO] Spark Project GraphX ............................... SUCCESS [02:17 min][INFO] Spark Project Streaming ............................ SUCCESS [04:35 min][INFO] Spark Project Catalyst ............................. SUCCESS [08:20 min][INFO] Spark Project SQL .................................. SUCCESS [14:59 min][INFO] Spark Project ML Library ........................... SUCCESS [09:13 min][INFO] Spark Project Tools ................................ SUCCESS [01:12 min][INFO] Spark Project Hive ................................. SUCCESS [11:40 min][INFO] Spark Project REPL ................................. SUCCESS [03:23 min][INFO] Spark Project YARN Shuffle Service ................. SUCCESS [02:26 min][INFO] Spark Project YARN ................................. SUCCESS [05:34 min][INFO] Spark Project Assembly ............................. SUCCESS [01:25 min][INFO] Spark Project External Flume Sink .................. SUCCESS [02:42 min][INFO] Spark Project External Flume ....................... SUCCESS [03:03 min][INFO] Spark Project External Flume Assembly .............. SUCCESS [ 40.208 s][INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [02:38 min][INFO] Spark Project Examples ............................. SUCCESS [08:19 min][INFO] Spark Project External Kafka Assembly .............. SUCCESS [01:10 min][INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [04:44 min][INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [02:35 min][INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [03:42 min][INFO] Spark Project Java 8 Tests ......................... SUCCESS [06:04 min][INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 01:58 h[INFO] Finished at: 2017-04-27T19:44:19+08:00[INFO] Final Memory: 88M/852M[INFO] ------------------------------------------------------------------------ configure spark distributed in hadoop.

Once spark has been compiled successfully, here will come up with the conf directory, and what we need to do is adding the hostname of every machine into the slaves file and configure the spark-env.sh file.

# add this code into slaves file.masterworker2worker3# add this code into spark-env.sh fileexport SPARK_MASTER_IP=masterexport SPARK_MASTER_PORT=7077export SPARK_WORKER_CORES=1export SPARK_WORKER_INSTANCES=1export SPARK_WORKER_MEMORY=512Mexport HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.2/etc/hadoop

And for convenience, we can configure the profile for spark environment.

vim /etc/profile# add the following code.export SPARK_HOME=/your directory/spark-2.1.0export PATH=$PATH:$SPARK_HOME/binsource /etc/profile

Finally, we should send the compiled spark code to the work machine.

scp -r spark-2.1.0 worker2:/usr/local/spark/scp -r spark-2.1.0 worker3:/usr/local/spark/

Now we can start spark-2.1.0 in hadoop-2.7.2 and make some test.

cd /your directory/spark-2.1.0./sbin/start-all.sh

And start the spark-shell, then will appear the following messages:

./bin/spark-shellSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).17/04/28 14:01:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/04/28 14:01:14 WARN spark.SparkConf: SPARK_WORKER_INSTANCES was detected (set to '1').This is deprecated in Spark 1.0+.Please instead use: - ./spark-submit with --num-executors to specify the number of executors - Or set SPARK_EXECUTOR_INSTANCES - spark.executor.instances to configure the number of instances in the spark config.Spark context Web UI available at context available as 'sc' (master = local[*], app id = local-1493402474643).Spark session available as 'spark'.Welcome to      ____              __     / __/__  ___ _____/ /__    _\ \/ _ \/ _ `/ __/  '_/   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0      /_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)Type in expressions to have them evaluated.Type :help for more information.

2.3 develop spark in IDEA(intellij)

2.3.1 configure java environment in window.

JAVA_HOME='your java directory'  # type java -verbose can see the java directory in your PC. CLASSPATH=%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar;PATH=%JAVA_HOME%\bin;            # add this into path.

After then, type java -version, javac, java and if you see these message, that means java is installed successfully!

Microsoft Windows [版本 6.1.7601]版权所有 (c) 2009 Microsoft Corporation。保留所有权利。C:\Users\user>java -versionjava version "1.8.0_112"Java(TM) SE Runtime Environment (build 1.8.0_112-b15)Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)C:\Users\user>javac用法: javac <options> <source files>其中, 可能的选项包括:  -g                         生成所有调试信息  -g:none                    不生成任何调试信息  -g:{lines,vars,source}     只生成某些调试信息  -nowarn                    不生成任何警告  -verbose                   输出有关编译器正在执行的操作的消息  -deprecation               输出使用已过时的 API 的源位置  -classpath <路径>            指定查找用户类文件和注释处理程序的位置  -cp <路径>                   指定查找用户类文件和注释处理程序的位置  -sourcepath <路径>           指定查找输入源文件的位置  -bootclasspath <路径>        覆盖引导类文件的位置  -extdirs <目录>              覆盖所安装扩展的位置  -endorseddirs <目录>         覆盖签名的标准路径的位置  -proc:{none,only}          控制是否执行注释处理和/或编译。  -processor <class1>[,<class2>,<class3>...] 要运行的注释处理程序的名称; 绕过默认的搜索进程  -processorpath <路径>        指定查找注释处理程序的位置  -parameters                生成元数据以用于方法参数的反射  -d <目录>                    指定放置生成的类文件的位置  -s <目录>                    指定放置生成的源文件的位置  -h <目录>                    指定放置生成的本机标头文件的位置  -implicit:{none,class}     指定是否为隐式引用文件生成类文件  -encoding <编码>             指定源文件使用的字符编码  -source <发行版>              提供与指定发行版的源兼容性  -target <发行版>              生成特定 VM 版本的类文件  -profile <配置文件>            请确保使用的 API 在指定的配置文件中可用  -version                   版本信息  -help                      输出标准选项的提要  -A关键字[=值]                  传递给注释处理程序的选项  -X                         输出非标准选项的提要  -J<标记>                     直接将 <标记> 传递给运行时系统  -Werror                    出现警告时终止编译  @<文件名>                     从文件读取选项和文件名C:\Users\user>java用法: java [-options] class [args...]           (执行类)   或  java [-options] -jar jarfile [args...]           (执行 jar 文件)其中选项包括:    -d32          使用 32 位数据模型 (如果可用)    -d64          使用 64 位数据模型 (如果可用)    -server       选择 "server" VM                  默认 VM 是 server.    -cp <目录和 zip/jar 文件的类搜索路径>    -classpath <目录和 zip/jar 文件的类搜索路径>                  用 ; 分隔的目录, JAR 档案                  和 ZIP 档案列表, 用于搜索类文件。    -D<名称>=<值>                  设置系统属性    -verbose:[class|gc|jni]                  启用详细输出    -version      输出产品版本并退出    -version:<值>                  警告: 此功能已过时, 将在                  未来发行版中删除。                  需要指定的版本才能运行    -showversion  输出产品版本并继续    -jre-restrict-search | -no-jre-restrict-search                  警告: 此功能已过时, 将在                  未来发行版中删除。                  在版本搜索中包括/排除用户专用 JRE    -? -help      输出此帮助消息    -X            输出非标准选项的帮助    -ea[:<packagename>...|:<classname>]    -enableassertions[:<packagename>...|:<classname>]                  按指定的粒度启用断言    -da[:<packagename>...|:<classname>]    -disableassertions[:<packagename>...|:<classname>]                  禁用具有指定粒度的断言    -esa | -enablesystemassertions                  启用系统断言    -dsa | -disablesystemassertions                  禁用系统断言    -agentlib:<libname>[=<选项>]                  加载本机代理库 <libname>, 例如 -agentlib:hprof                  另请参阅 -agentlib:jdwp=help 和 -agentlib:hprof=help    -agentpath:<pathname>[=<选项>]                  按完整路径名加载本机代理库    -javaagent:<jarpath>[=<选项>]                  加载 Java 编程语言代理, 请参阅 java.lang.instrument    -splash:<imagepath>                  使用指定的图像显示启动屏幕有关详细信息, 请参阅 http://www.oracle.com/technetwork/java/javase/documentation/index.html。

2.3.2 configure built.sbt

# add the following code libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"

2.3.3 example code

import org.apache.spark.{SparkConf, SparkContext}object test {    def main(args: Array[String]): Unit = {        System.setProperty("hadoop.home.dir", "D:\\spark\\IntelliJ IDEA\\work sheet\\hadoop-2.7.2")        if (args.length < 1) {            System.err.println("Usage: <file>")            System.exit(1)        }        val conf = new SparkConf()        val sc = new SparkContext("local","wordcount",conf)        val line = sc.textFile(args(0))        line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)        sc.stop()    }}

We need to assign the parameter in ‘Edit configuration’ in intellij:

Program argument: E:\Distributed\Configuration\Distributed_configure.md


....17/05/08 17:55:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 17/05/08 17:55:41 INFO DAGScheduler: ResultStage 1 (collect at test.scala:20) finished in 0.100 s17/05/08 17:55:41 INFO DAGScheduler: Job 0 finished: collect at test.scala:20, took 1.311414 s(,1977)(the,95)(```,88)([INFO],51)(to,47)(#,43)(Spark,33)(SUCCESS,32)(and,27)(Project,26)(in,25)(export,23)(min],22)(it,19)....
3 0