Learning Spark
来源:互联网 发布:百分百营销软件流量 编辑:程序博客网 时间:2024/04/19 07:11
Spark basic concepts:
1, RDD (resillient distributed dataset) 2, Task: shuffleMapTask and resultTask (simillar to map and reduce) 3, Job: a job can be made of multiple tasks 4, Stage: a job can have multiple stages 5, Partition: RDD can be partitioned into different machine 6, NarrowDependency: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution. 7, ShuffleDependency: also called wideDependency, child RDD depend on all partitions of the parent RDD. 8: DAG: Directed Acycle graph, no parent depend on child RDD
Spark Core functions:
1, SparkContext: for driverApplication execution and output, we need to initiallize SparkContext before submit spark jobs SparkContext has: 1) communiation 2) distributed deployment 3) message 4) storage 5) computation 6) cashe 7) measurement system 8) file service 9) web service Application need use SparkContext API to create jobs, use DAGScheduler, plan RDDs in DAG to different stages and submit the stages. use TaskScheduler, apply resouces, submit jobs and requst cluster for scheduling 2, Storage System 1) Spark take memory as priority, if memory is not enough, then consider to use disk, Tachyon (distributed memory file system) 3, Computation Engine: 4, Deployment 1) Standalone 2) Yarn 3) Mesos
Tuning Spark:
1, Data Serialization: 1) Java serializaion (object --> byte --> object) 2) Kyro serializaton (10x faster than Java serialization) (object --> object) val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) 2, Memory Tuning: 1) object header: 16 bytes 2) String header: 40 bytes 3) Common collection class: HashMap or LinkedList, 8 bytes 4) Collections of primitive types often store them as "boxed" object as java.lang.Integer 3, Memory management overview 1)Memory usage in Spark largely falls under one of two categories: execution and storage. a) Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations b) Storage memory refers to that used for caching and propagating internal data across the cluster 2) M/R a) When no execution memory is used, storage can acquire all the available memory and vice versa. b) R describes a subregion within M where cached blocks are never evicted 3) This design ensures several desirable properties: a) First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. b) Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. c) Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
0 0
- learning spark
- Learning Spark
- Spark Learning
- Spark Learning Notes
- Spark Machine Learning 总览
- Learning Spark笔记1-Spark介绍
- Learning Spark笔记3-传递函数给Spark
- Learning Spark笔记16-调试Spark
- Learning Spark笔记17-Spark SQL
- spark on machine learning--数据类型
- Apache Spark Machine Learning Tutorial
- sparklyr包:Spark Machine Learning
- machine learning with spark (1)
- Learning Spark笔记7-聚合
- Machine Learning With Spark--读书笔记
- Learning Spark笔记10-PageRank
- Learning Spark笔记12-累加器
- Spark Learning(RDD介绍)
- android自定义控件之飞入飞出控件
- 解决线上问题的办法
- Masonry和UITableView-FDTemplateLayoutCell结合使用时遇过的坑
- SQL2008如何将多行转换成多列
- spring资料整理
- Learning Spark
- spring如何注入Date类型的属性
- 成为杰出的Java程序员只需10招
- HTML5-复选框
- HttpModule的认识
- css布局:圣杯布局
- day11_python多线程基础
- st表
- UVA 1379 Pitcher Rotation