一、创建hadoop用户及目录
1 创建用户[root@hadoop000 ~]2 设置密码[root@hadoop000 ~]3 切换用户 hadoop[root@hadoop000 ~][hadoop@hadoop000 ~]$ pwd/home/hadoop4 创建目录[hadoop@hadoop000 ~]$ mkdir app data software lib source[hadoop@hadoop000 ~]$ ll总用量 20drwxrwxr-x 2 hadoop hadoop 4096 9月 6 09:59 appdrwxrwxr-x 2 hadoop hadoop 4096 9月 6 09:59 datadrwxrwxr-x 2 hadoop hadoop 4096 9月 6 09:59 libdrwxrwxr-x 2 hadoop hadoop 4096 9月 6 09:59 softwaredrwxrwxr-x 2 hadoop hadoop 4096 9月 6 09:59 source
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
二、编译所需软件安装(JDK/MAVEN/SCALA/GIT)
http:Building Apache SparkApache MavenThe Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+. Note that support for Java 7 was removed as of Spark 2.2.0.从官网得知编译spark2.2.0需要JDK1.8+和maven 3.3.9
**注:上传的jdk maven scala 等安装包都是用hadoop用户安装的,
所以需执行chown -R hadoop:hadoop xxx.tar.gz来让hadoop用户拥有所有权限**
1、安装JDK1.8
JDK1.8下载地址
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
将jdk解压到hadoop用户下app目录[hadoop@hadoop000 software]$ tar -zxvf jdk-8u144-linux-x64.tar.gz -C ~/app/[hadoop@hadoop000 jdk1.8.0_144]$ pwd/home/hadoop/app/jdk1.8.0_144编辑 ~/.bash_profile添加以下两行export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144export PATH=$JAVA_HOME/bin:$PATHsource ~/.bash_profile 使环境变量生效验证是否配置成功[hadoop@hadoop000 jdk1.8.0_144]$ which java~/app/jdk1.8.0_144/bin/java[hadoop@hadoop000 jdk1.8.0_144]$ java -versionjava version "1.8.0_144"Java(TM) SE Runtime Environment (build 1.8.0_144-b01)Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
2、 安装MAVAEN3.3.9
maven3.3.9下载地址
https://mirrors.tuna.tsinghua.edu.cn/apache//maven/maven-3/3.3.9/binaries/
解压[hadoop@hadoop000 maven-3.3.9]$ unzip apache-maven-3.3.9-bin.zip移动到app目录[hadoop@hadoop000 maven-3.3.9]$ mv apache-maven-3.3.9 ~/app/更名[hadoop@hadoop000 maven-3.3.9]$ mv apache-maven-3.3.9/ maven-3.3.9/查看路径[hadoop@hadoop000 maven-3.3.9]$ pwd/home/hadoop/app/maven-3.3.9编辑环境变量,记得source生效[hadoop@hadoop000 maven-3.3.9]$ vim ~/.bash_profileexport MAVEN_HOME=/home/hadoop/app/maven-3.3.9export PATH=$MAVEN_HOME/bin:$PATH验证[hadoop@hadoop000 maven-3.3.9]$ which mvn~/app/maven-3.3.9/bin/mvn[hadoop@hadoop000 maven-3.3.9]$ mvn -versionApache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)Maven home: /home/hadoop/app/maven-3.3.9Java version: 1.8.0_144, vendor: Oracle CorporationJava home: /home/hadoop/app/jdk1.8.0_144/jreDefault locale: zh_CN, platform encoding: UTF-8OS name: "linux", version: "2.6.32-431.el6.x86_64", arch: "amd64", family: "unix"
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
3、scala安装
为后续操作spark,现安装scala . scala2.11.7下载地址
http://www.scala-lang.org/download/2.11.7.html
解压到app目录 [hadoop@hadoop000 software]$ tar -zxvf scala-2.11.7.tgz -C ~/app/查看路径[hadoop@hadoop000 app]$ cd scala-2.11.7/[hadoop@hadoop000 scala-2.11.7]$ pwd/home/hadoop/app/scala-2.11.7配置环境变量,记得source生效[hadoop@hadoop000 scala-2.11.7]$ vim ~/.bash_profileexport SCALA_HOME=/home/hadoop/app/scala-2.11.7export PATH=$SCALA_HOME/bin:$PATH验证[hadoop@hadoop000 scala-2.11.7]$ scala -versionScala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL[hadoop@hadoop000 scala-2.11.7]$ scalaWelcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144).Type in expressions to have them evaluated.Type :help for more information.scala>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
4、Git安装(可选)
使用root在/etc/sudoers文件下添加 hadoop ALL=(ALL) ALL[root@hadoop000 bin]# vim /etc/sudoers#Allow root to run any commands anywhereroot ALL=(ALL) ALLhadoop ALL=(ALL) ALL:wq!强制保存退出hadoop用户使用sudo权限安装git[hadoop@hadoop000 scala-2.11.7]$ sudo yum -y install gitLoaded plugins: fastestmirror, refresh-packagekit, securityLoading mirror speeds from cached hostfile * base: mirrors.sohu.com * extras: mirrors.sohu.com * updates: mirrors.sohu.combase | 3.7 kB 00:00base/primary_db | 4.7 MB 00:07http://192.168.95.10/iso/repodata/repomd.xml: [Errno 14] PYCURL ERROR 7 - "couldn't connect to host"Trying other mirror.Error: Cannot retrieve repository metadata (repomd.xml) for repository: centos-iso. Please verify its path and try again[hadoop@hadoop000 scala-2.11.7]$ sudo yum -y install gitLoaded plugins: fastestmirror, refresh-packagekit, securityLoading mirror speeds from cached hostfile * base: mirrors.sohu.com * extras: mirrors.sohu.com * updates: mirrors.sohu.comextras | 3.4 kB 00:00extras/primary_db | 29 kB 00:00updates | 3.4 kB 00:00updates/primary_db | 3.1 MB 00:07Setting up Install ProcessResolving Dependencies--> Running transaction check---> Package git.x86_64 0:1.7.1-9.el6_9 will be installed--> Processing Dependency: perl-Git = 1.7.1-9.el6_9 for package: git-1.7.1-9.el6_9.x86_64--> Processing Dependency: perl(Git) for package: git-1.7.1-9.el6_9.x86_64--> Processing Dependency: perl(Error) for package: git-1.7.1-9.el6_9.x86_64--> Running transaction check---> Package perl-Error.noarch 1:0.17015-4.el6 will be installed---> Package perl-Git.noarch 0:1.7.1-9.el6_9 will be installed--> Finished Dependency ResolutionDependencies Resolved======================================================================================================================================== Package Arch Version Repository Size========================================================================================================================================Installing: git x86_64 1.7.1-9.el6_9 updates 4.6 MInstalling for dependencies: perl-Error noarch 1:0.17015-4.el6 base 29 k perl-Git noarch 1.7.1-9.el6_9 updates 29 kTransaction Summary========================================================================================================================================Install 3 Package(s)Total download size: 4.7 MInstalled size: 15 MDownloading Packages:(1/3): git-1.7.1-9.el6_9.x86_64.rpm | 4.6 MB 00:08(2/3): perl-Error-0.17015-4.el6.noarch.rpm | 29 kB 00:00(3/3): perl-Git-1.7.1-9.el6_9.noarch.rpm | 29 kB 00:00----------------------------------------------------------------------------------------------------------------------------------------Total 560 kB/s | 4.7 MB 00:08warning: rpmts_HdrFromFdno: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEYRetrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6Importing GPG key 0xC105B9DE: Userid : CentOS-6 Key (CentOS 6 Official Signing Key) <centos-6-key@centos.org> Package: centos-release-6-5.el6.centos.11.1.x86_64 (@anaconda-CentOS-201311272149.x86_64/6.5) From : /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6Running rpm_check_debugRunning Transaction TestTransaction Test SucceededRunning Transaction Installing : 1:perl-Error-0.17015-4.el6.noarch 1/3 Installing : git-1.7.1-9.el6_9.x86_64 2/3 Installing : perl-Git-1.7.1-9.el6_9.noarch 3/3 Verifying : 1:perl-Error-0.17015-4.el6.noarch 1/3 Verifying : git-1.7.1-9.el6_9.x86_64 2/3 Verifying : perl-Git-1.7.1-9.el6_9.noarch 3/3Installed: git.x86_64 0:1.7.1-9.el6_9Dependency Installed: perl-Error.noarch 1:0.17015-4.el6 perl-Git.noarch 0:1.7.1-9.el6_9Complete![hadoop@hadoop000 scala-2.11.7]$
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
三、开始编译
1、在开始编译之前我们问一下自己为什么要编译?官网上不是有sprak安装包吗?
官网上的确提供了一些Hadoop版本的Spark安装包,但是我们的生产环境各异,提供的安装包能满足现实的环境要求吗?答案肯定是否定的,所以我们需要编译一个能够集成到自己生产环境当中的Hadoop版本的Spark安装包。
2、spark-2.2.0.tgz包下载
http://spark.apache.org/downloads.html
将下载好的安装包解压到hadoop用户目录下的source文件中
[hadoop@hadoop000 software]$ tar -zxvf spark-2.2.0.tgz -C ~/source/
解压后目录如下
3 、从官网得知编译命令如下(http://spark.apache.org/docs/latest/building-spark.html)
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
参数说明:
–name:指定编译完成后Spark安装包的名字
–tgz:以tgz的方式进行压缩
-Psparkr:编译出来的Spark支持R语言
-Phadoop-2.4:以hadoop-2.4的profile进行编译,具体的profile可以看出源码根目录中的pom.xml中查看
-Phive和-Phive-thriftserver:编译出来的Spark支持对Hive的操作
-Pmesos:编译出来的Spark支持运行在Mesos上
-Pyarn:编译出来的Spark支持运行在YARN上
那么我们可以根据具体的条件来编译Spark,比如我们使用的Hadoop版本是2.6.0-cdh5.7.0,并且我们需要将Spark运行在YARN上、支持对Hive的操作,那么我们的Spark源码编译脚本就是:
./dev/make-distribution.sh –name 2.6.0-cdh5.7.0 –tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0
编译成功后,在Spark源码的根目录中就spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz包,那么我们就可以使用编译出来的这个安装包来进行Spark的安装了
4 、编译前的修改
a. 在spark2.2.0根目录下的pom.xml中的repository标签内(219行)添加以下内容
<repository> <id>cloudera</id> <name>cloudera repository</name> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url></repository>
pom.xml中自定义的url是https://repo1.maven.org/maven2,这个url是Apache Hadop的仓库地址。而我们编译命令加了–name 2.6.0-cdh5.7.0,编译出来的spark包支持集成到cdh5.7.0集群中。如果不加以上url,编译会报错(自定义的url中没有cdh的内容,大家可以把地址复制到浏览器中自测),加了以上内容后pom.xml自定义的url中找不到就会去我们定义的url找。
b.将./dev/make-distribution.sh脚本的120-136行内容注释掉,添加以下内容
VERSION=2.2.0
SCALA_VERSION=2.11
SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0
SPARK_HIVE=1
脚本120-136行是校验我们安装的scala版本等信息,我们把它注释掉,直接告诉它我们安装的版本可以提高编译速度
5 、执行命令编译
./dev/make-distribution.sh –name 2.6.0-cdh5.7.0 –tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0
编译成功后在spark2.2.0根目录会有一个 spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz文件
注意:在编译过程中会出现下载某个依赖包的时间太久,这是由于网络问题,可以执行ctrl+c停止编译命令,然后重新运行编译命令,在编译过程中多试几次即可。有条件的小伙伴,建议开着VPN然后再进行编译,整个编译过程会顺畅很多。