Spark学习笔记之初识
来源:互联网 发布:马云 蚂蚁金服 知乎 编辑:程序博客网 时间:2024/06/04 19:08
1 spark官网 http://spark.apache.org/
2 学习版本为1.5.0
Spark架构,官方文档解读
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
跟其他分布式系统一样,每个节点的spark 应用程序都是一系列独立的进程,这些进程由主节点的SparkContext对象管理,这个对象叫做驱动程序。
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
集群管理程序可能有很多种,Mesos or YARN等,主要是为应用程序分配资源,SparkContext要和集群管理程序进行连接才能在多集群上驱动应用程序。
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
完成连接之后,SparkContext向各个节点发送执行代码,最后分配执行任务。
注意点:
1 Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
不同的SparkContext之间不能共享数据
2 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
spark对YARN等集群管理器有很好的支持
3The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
驱动程序一直监视节点知道任务完成,因此这个期间要保证主节点和其他业务节点的网络通信
4Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最好做本地集群,驱动服务器和执行节点服务器最好在一个物理位置上就很靠近的局域网之内
- Spark学习笔记之初识
- Spark学习笔记:初识Spark
- spark学习笔记:初识spark
- 蜗龙徒行-Spark学习笔记【一】初识Spark形成、演进、发展
- 学习笔记之初识Python
- 初识Spark之 Spark API
- Spark学习笔记之-Spark远程调试
- Spark学习笔记之-Spark常用概念
- Spark学习笔记之-Spark-Streaming
- 初识Spark之 基本概念
- Django学习笔记之【Django初识】 .
- NoSQL数据库学习笔记之 初识MongoDB
- NoSQL数据库学习笔记之 初识Redis
- linux学习笔记之 初识linux
- C#学习笔记之初识C#
- C#学习笔记之初识LINQ查询
- GSON学习笔记之初识GSON
- PS学习笔记之初识分辨率
- 小波变换入门知识总结
- LeJOS学习(6):Sensor的API研究-TouchSensor
- HDU 5523:Game
- 在iPhone上使用3D Touch
- hdoj 2579 Dating with girls(2) 【BFS&&三维数组标记】
- Spark学习笔记之初识
- 【C#】winform子窗体与父窗体传值,子窗体与子窗体传值
- 装饰模式----设计模式系列
- 职业生涯规划
- poj2367 拓扑排序入门
- iOS触摸事件处理
- 解决SSH框架中Struts不能接受Android的不同Key值上传图片问题
- Java ”22/Sep/2015:00:18:59“转换格式
- 第一次启动Eclipse,显示没有找到javaw.exe