Spark-2.x 编译构建及配置安装
来源:互联网 发布:c语言需要什么软件 编辑:程序博客网 时间:2024/06/03 16:18
0. Spark-2.x 编译环境准备
编译服务器:ip 编译目录:/data10/spark/
1. Spark-2.x编译
a. note:提高Maven编译时的堆内存大小,防止编译过程中产生OOM异常,相关命令如下:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
b. 编译
cd $SPARK_HOME(spark源码路径) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phadoop-provided -Phive -Phive-thriftserver -Pnetlib-lgpl -DskipTests clean package
2. Spark-2.x构建(Distribution)
a. 编译成功后,构建spark执行环境,(note:spark-2.0.0和spark-2.0.1的稍微有不同),执行命令如下:
spark-2.0.0
./dev/make-distribution.sh --name dev --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phadoop-provided -Phive -Phive-thriftserver -Pnetlib-lgpl
spark-2.0.1
./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn
b.构建成功,运行环境成生成在$SPARK_HOME/dist目录下
c.重命名( mv $SPARK_HOME/dist mv $SPARK_HOME/spark-2.x.x-bin),则$SPARK_HOME/spark-2.x.x-bin 则作为发布版本
3. Spark-2.x配置
a.修改spark-2.x.x-bin/conf/spark-default.conf 配置参考如下:
Note:
1.spark.yarn.archive: 配置的value是一个目录,目录下的jar内容和spark-2.x.x-bin/jars目录一致 即将spark-2.x.x-bin/jars 上传到hdfs的指定目录,如hdfs://ns1/spark/jars/spark2.x.x_jars 2.发布新版本之后,一定要修改spark.yarn.archive 的值,其他的配置可以参考spark-2.0.0的配置
spark-default.conf
##################### common for yarn #####################spark.yarn.archive hdfs://ns1/spark/jars/spark2.0.1_jarsspark.yarn.historyServer.address yz724.hadoop.data.sina.com.cn:18080spark.eventLog.enabled truespark.eventLog.dir hdfs://ns1/spark/logsspark.yarn.report.interval 3000spark.yarn.maxAppAttempts 2spark.yarn.submit.file.replication 10spark.rdd.compress truespark.dynamicAllocation.enabled falsespark.ui.port 4050##################### common for yarn #####################spark.yarn.archive hdfs://ns1/spark/jars/spark2.0.1_jarsspark.yarn.historyServer.address yz724.hadoop.data.sina.com.cn:18080spark.eventLog.enabled truespark.eventLog.dir hdfs://ns1/spark/logsspark.yarn.report.interval 3000spark.yarn.maxAppAttempts 2spark.yarn.submit.file.replication 10spark.rdd.compress truespark.dynamicAllocation.enabled falsespark.ui.port 4050spark.kryoserializer.buffer.max 128mspark.task.maxFailures 10### common shuffle ###spark.shuffle.service.enabled falsespark.shuffle.io.maxRetries 20spark.shuffle.io.retryWait 5##################### driver ######################spark.driver.extraJavaOptions=-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:MaxDirectMemorySize=1g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStampsspark.driver.extraJavaOptions=-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:MaxDirectMemorySize=1gspark.driver.extraLibraryPath=/usr/local/hadoop-2.4.0/lib/nativespark.driver.maxResultSize=1g#spark.jars=/data0/spark/spark-2.0.0-bin/jars/javax.servlet-api-3.1.0.jar##################### executor #####################spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:MaxDirectMemorySize=300m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStampsspark.executor.extraLibraryPath /usr/local/hadoop-2.4.0/lib/nativespark.yarn.executor.memoryOverhead 400spark.executor.logs.rolling.strategy timespark.executor.logs.rolling.time.interval dailyspark.executor.logs.rolling.maxRetainedFiles 7##################### spark sql #####################spark.sql.hive.metastorePartitionPruning true##################### spark streaming #####################spark.streaming.kafka.maxRetries 20################### spark tasks ##########################spark.scheduler.executorTaskBlacklistTime 100000
b. 修改spark-2.x.x-bin/conf/spark-env.conf 配置参考如下:
HADOOP_CONF_DIR=/usr/local/hadoop-2.4.0/etc/hadoopSPARK_PRINT_LAUNCH_COMMAND=trueLD_LIBRARY_PATH=/usr/local/hadoop-2.4.0/lib/native
c. copy /usr/local/hive-0.13.0/conf/hive-site.xml 到 spark-2.x.x-bin/conf/,否则有异常如下:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=chaochao1, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
d. 修改 spark-2.x.x-bin/conf/hive-site.xml,修改内容如下:
修改前:<property> <name>javax.jdo.option.ConnectionPassword</name> <value>16B78E6FED30A530</value> <description>password to use against metastore database</description></property>修改后:<property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive2</value> <description>password to use against metastore database</description></property>
note:不修改上述内容会有如下异常:
java.sql.SQLException: Access denied for user 'hive2'@'10.39.3.142' (using password: YES)
d. 配置mysql的jdbc驱动,否则会出现异常
1.copy mysql-connector-java-5.1.15-bin.jar 到 spark-2.x.x-bin/jars/ 2.copy mysql-connector-java-5.1.15-bin.jar 上传到 spark.yarn.archive 配置的目录下
异常如下:
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)... 88 more
4. Spark-2.x安装
a. jdk版本升级到1.7
b. 在 /usr/local/hadoop-2.4.0/etc/hadoop/yarn-site.xml 添加如下配置,否则导致无法显示Spark UI
<property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>rm1.hadoop.data.sina.com.cn:9008</value> <final>true</final> </property> rm2的是可选参数 <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>rm2.hadoop.data.sina.com.cn:9008</value> <final>true</final> </property>
c.更新spark依赖的hive相关的jar
Spark在1.5.0版本以后默认集成的hive版本为1.2.1,hive从0.14版本之后有关scratchdir的权限配置发生了变动(从755变成了733)。因此如果线上已经部署了hive-0.13应用,那么在基于spark1.6提交hiveSQL的时候,会出现权限不匹配的相关异常。 目前的改进办法是修改Hive的org.apache.hadoop.hive.ql.session.SessionState类(社区1.2.1版本),执行createRootHDFSDir方法时改成755权限赋值,以此来做到向下兼容。在将编译完成以后的SessionState类拷贝到spark的assembly包和examples包中。 相关patch参考:因为spark-2.0.0的所依赖的hive-1.2.1已经打好patch 所以:
1. 10.39.3.142 /data0/spark/spark-2.0.0/jars/有关hive的jar替换spark-2.x.x-bin/jars的hive相关的jar2. 10.39.3.142 /data0/spark/spark-2.0.0/jars/有关hive的jar替换spark.yarn.archive配置的目录下hive相关的jar
d.rsync 配置好的spark-2.x.x-bin 到对应的服务器的/data0/spark/目录下
5.ference
a. /data0/spark/spark-2.0.0 的相关配置 b.官方编译文档:http://spark.apache.org/docs/latest/building-spark.html c.Spark 编译配置
0 0
- Spark-2.x 编译构建及配置安装
- Spark安装及环境配置
- spark源码编译、配置安装、测试
- 编译 Spark 1.x
- Openfire、Spark的安装及配置
- Spark 安装配置及基本操作示例
- Spark学习00---介绍及安装配置
- 编译安装及配置PHP7
- Maven安装配置及WEB工程构建
- spark 1.X standalone和on yarn安装配置
- spark环境构建及示例
- Spark系列(一)Spark1.0.0源码编译及安装
- Struts2.1.x 安装及配置
- Struts2.1.x 安装及配置
- Zimbra8.x邮件服务器安装及配置
- Zimbra7.x邮件服务器安装及配置
- 如何安装及配置tomcat7.x
- Hive2.x 版本的安装及配置
- HTTP 协议详解 (补充)
- Linux基础学习笔记(Linux达人养成计划 I)
- IOS 点赞效果
- 【译】前端是 ? 而 JavaScript 是 ?
- 认识多任务、多进程、单线程、多线程
- Spark-2.x 编译构建及配置安装
- 数据库水平切分的实现原理解析——分库,分表,主从,集群,负载均衡器
- ES-Checker
- oracle中dblink的创建
- 函数_callable(object)
- 分享一款超轻量级数据库访问帮助器 -DbUtility (alias: DataPithy)
- react 学习--组件的生命周期(二)运行阶段
- MySQL MySql连接数与线程池
- 软件测试练习题