Spark学习之2:Worker启动流程
来源:互联网 发布:mac 自带五笔 编辑:程序博客网 时间:2024/05/29 17:02
1. 启动脚本
sbin/start-slaves.sh
# Launch the slavesif [ "$SPARK_WORKER_INSTANCES" = "" ]; then exec "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"else if [ "$SPARK_WORKER_WEBUI_PORT" = "" ]; then SPARK_WORKER_WEBUI_PORT=8081 fi for ((i=0; i<$SPARK_WORKER_INSTANCES; i++)); do "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" $(( $i + 1 )) "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" --webui-port $(( $SPARK_WORKER_WEBUI_PORT + $i )) donefi
假设每个节点启动一个Worker。
具体执行:
exec "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"
该语句分为两部分:
(1)
exec "$sbin/slaves.sh" cd "$SPARK_HOME"
登录到worker服务器并cd到SPARK_HOME目录。
(2)
"$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"
在worker服务器执行sbin/start-slave.sh脚本。
参数“1”代码worker的编号,用来区分不同worker实例的日志文件。如:
spark-xxx-org.apache.spark.deploy.worker.Worker-1-CentOS-02.outspark-xxx-org.apache.spark.deploy.worker.Worker-1.pid
其中“Worker-1”中的“1”就代表worker编号。
这个参数并不会传入Worker类。传入Worker类的参数为:
spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT。
2. Worker.main
def main(argStrings: Array[String]) { SignalLogger.register(log) val conf = new SparkConf val args = new WorkerArguments(argStrings, conf) val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir) actorSystem.awaitTermination() }
main函数的职责:
(1)创建WorkerArguments对象并初始化其成员;
(2)调用startSystemAndActor方法,创建ActorSystem对象并启动Worker actor;
2.1. WorkerArguments
var cores = inferDefaultCores() var memory = inferDefaultMemory()
(1)计算默认核数
(2)计算默认内存大小
parse(args.toList) // This mutates the SparkConf, so all accesses to it must be made after this line propertiesFile = Utils.loadDefaultSparkProperties(conf, propertiesFile)
(1)parse方法负责解析启动脚本所带的命令行参数;
(2)loadDefaultSparkProperties负责从配置文件中加载spark运行属性,默认而配置文件为spark-defaults.conf;
2.2. startSystemAndActor
val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port, conf = conf, securityManager = securityMgr) val masterAkkaUrls = masterUrls.map(Master.toAkkaUrl(_, AkkaUtils.protocol(actorSystem))) actorSystem.actorOf(Props(classOf[Worker], host, boundPort, webUiPort, cores, memory, masterAkkaUrls, systemName, actorName, workDir, conf, securityMgr), name = actorName)
(1)通过AkkaUtils.createActorSystem创建ActorSystem对象
(2)创建Worker actor并启动
3. Worker Actor
3.1. 重要数据成员
val executors = new HashMap[String, ExecutorRunner] val finishedExecutors = new HashMap[String, ExecutorRunner] val drivers = new HashMap[String, DriverRunner] val finishedDrivers = new HashMap[String, DriverRunner] val appDirectories = new HashMap[String, Seq[String]] val finishedApps = new HashSet[String]
3.2. Worker.preStart
createWorkDir() context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent]) shuffleService.startIfEnabled() webUi = new WorkerWebUI(this, workDir, webUiPort) webUi.bind() registerWithMaster()
(1)创建Worker节点工作目录;
(2)监听RemotingLifecycleEvent事件,它一个trait:
sealed trait RemotingLifecycleEvent extends Serializable { def logLevel: Logging.LogLevel}
Worker只处理了DisassociatedEvent消息。
(3)创建并启动WorkerWebUI
(4)向Master进行注册,registerWithMaster将调用tryRegisterAllMasters方法向Master节点发送注册消息
3.3. Worker.registerWithMaster
registrationRetryTimer match { case None => registered = false tryRegisterAllMasters() connectionAttemptCount = 0 registrationRetryTimer = Some { context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL, INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster) } case Some(_) => logInfo("Not spawning another attempt to register with the master, since there is an" + " attempt scheduled already.") }
(1)调用tryRegisterAllMasters方法向Master发起注册消息;
(2)创建注册重试定时器,通过向自己(Worker Actor)发送ReregisterWithMaster消息;
3.3.1. Worker.tryRegisterAllMasters
for (masterAkkaUrl <- masterAkkaUrls) { logInfo("Connecting to master " + masterAkkaUrl + "...") val actor = context.actorSelection(masterAkkaUrl) actor ! RegisterWorker(workerId, host, port, cores, memory, webUi.boundPort, publicAddress) }
(1)创建Master Actor远程引用;
(2)向Master发送RegisterWorker消息;如果注册成功,Master将向Worker发送RegisteredWorker消息。
workerId是一个字符串,定义:
val workerId = generateWorkerId() ... def generateWorkerId(): String = { "worker-%s-%s-%d".format(createDateFormat.format(new Date), host, port) }
格式:worker-时间-主机名-端口
3.4. Worker消息处理
3.4.1. RegisteredWorker消息
此消息表示Worker向Master注册成功消息;该消息处理的主要目的是启动心跳发送定时器。
case RegisteredWorker(masterUrl, masterWebUiUrl) => logInfo("Successfully registered with master " + masterUrl) registered = true changeMaster(masterUrl, masterWebUiUrl) context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis, self, SendHeartbeat) if (CLEANUP_ENABLED) { logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir") context.system.scheduler.schedule(CLEANUP_INTERVAL_MILLIS millis, CLEANUP_INTERVAL_MILLIS millis, self, WorkDirCleanup) }
(1)设置注册状态;
(2)调用changeMaster方法
(3)创建心跳发送定时器,向自己(Worker Actor)发送SendHeartbeat消息;
3.4.1.1. Worker.changeMaster
// activeMasterUrl it's a valid Spark url since we receive it from master. activeMasterUrl = url activeMasterWebUiUrl = uiUrl master = context.actorSelection( Master.toAkkaUrl(activeMasterUrl, AkkaUtils.protocol(context.system))) masterAddress = Master.toAkkaAddress(activeMasterUrl, AkkaUtils.protocol(context.system)) connected = true // Cancel any outstanding re-registration attempts because we found a new master registrationRetryTimer.foreach(_.cancel()) registrationRetryTimer = None
职责:
(1)创建Master远程引用并赋值给master;
(2)将连接状态设置为true;
(3)取消registrationRetryTimer定时器;
3.4.2. SendHeartbeat消息
case SendHeartbeat => if (connected) { master ! Heartbeat(workerId) }
向master发送Heartbeat消息。
3.4.3. ReregisterWithMaster消息
case ReregisterWithMaster => reregisterWithMaster()
reregisterWithMaster方法职责:
(1)如果已经注册成功,取消registrationRetryTimer定时器;
(2)如果注册失败,从新向master发送RegisterWorker消息;初始默认重连次数为6,最大重连次数为16。
// The first six attempts to reconnect are in shorter intervals (between 5 and 15 seconds) // Afterwards, the next 10 attempts are between 30 and 90 seconds. // A bit of randomness is introduced so that not all of the workers attempt to reconnect at // the same time. val INITIAL_REGISTRATION_RETRIES = 6 val TOTAL_REGISTRATION_RETRIES = INITIAL_REGISTRATION_RETRIES + 10
前6次和后10次采用不同的周期。
4. 启动结束
到此,Worker节点就启动完成,它定时向Master节点发送心跳。在SparkSubmit提交Application时,将接收Master发送的启动Executor消息,由Executor和Driver进行消息通信。
- Spark学习之2:Worker启动流程
- Spark集群启动之Master、Worker启动流程源码分析
- Spark源码学习(二)---Master和Worker的启动以及Actor通信流程
- Spark学习之1:Master启动流程
- spark core源码分析4 worker启动流程
- spark core源码分析4 worker启动流程
- spark源码分析Master与Worker启动流程篇
- spark 1.6.0 core源码分析4 worker启动流程
- [spark] Standalone模式下Master、WorKer启动流程
- Spark分析之Worker
- Spark源码分析之worker节点启动driver和executor
- Spark源码分析之Worker启动通信机制
- spark源码学习(三)---worker源码分析-worker启动driver、executor分析
- Spark源码解读(2)——Worker启动过程
- Spark学习之16:Spark Streaming执行流程(2)
- Spark源码分析之Worker
- Spark源码分析之Worker
- Spark RPC之Worker实现
- ECSHOP结算中心"立即使用支付宝支付"按钮如何美化
- java反射机制(三)
- arcgis for android常见问题回答
- 从4个层面扩展和自定义Calabash-Android
- 关键路径
- Spark学习之2:Worker启动流程
- [epoll服务器]第一次写服务器的总结
- 一个简单的问题
- 基于BFS的最大流算法(Edmonds)
- [IOS 开发] performSelectorXXX的延迟的使用
- session多服务器共享的方案梳理
- python文件数据操作
- 求图的切割点
- Android makefile 书写