RDD的属性
来源:互联网 发布:松下触摸屏软件说明书 编辑:程序博客网 时间:2024/06/03 18:47
RDD resilient Distributed Dataset
- properties:
- Immutable
- lazy evaluated
- Cacheable
- Type inferred
What's Immutable?
- once created never changes
- Big data by default immutable in nature
- Immutability helps to: (1) Parallelize; (2) Caching
Why Big Data is immutable?
- Parallelize for free, no need to lock;
- Caching is safe, no worry for other change
- immutability is about value not about reference
Immutability in collections
- uses transformation for change. e.g. MAP
- creates a new copy of collection leaves collection intact.
- uses loop for updating mutable collections in place
Chanllenges of Immutability
- good for parallelism but no good for space
- multiple transformations result in: (1) Multiple of copies of data; (2) multiple passes of data
- poor performance for multiple of copies and passes of data.
Get lazy for the chanllenges
- don't computing transformations till it's need
- defers evaluation
- separate execution from evaluation
- multiple transformations are combined in one
Laziness and immutability
- you can be lazy only if the underneath data is immutable
- you cannot combine transformation if transformation has side effect
- combining laziness and immutability gives better performance and distributed processing
Chanllenges of Laziness :type inference
- Laziness poses chanllenges in terms of data type
- if laziness deters execution, determining the type of variable becomes chanllenging
- if we can't determine the right type, it allows to have semantic issues
- running big data programs and getting semantics errors are not fun.
Type inference
- part of compiler to determining the type by value
- as all the transformation are side effect free, we can determine the type by operation; v1.count() inferred as Int
- every transformation has specific return type; map array gets array
- having type inference relieves you think about representation for many transforms
Caching
- immutable data allows you to cache data for long time
- lazy transformation allows to recreate data on failure; from linear
- transformations can be saved also; as linear
- caching data improves execution engine performance
RDD means big collection of data with above properties.
0 0
- RDD的属性
- Spark之RDD的属性
- spark RDD的5个重要内部属性
- Spark RDD的缓存 rdd.cache() 和 rdd.persist()
- Spark RDD的缓存 rdd.cache() 和 rdd.persist()
- 创建RDD和RDD的持久化
- RDD的依赖关系
- spark RDD的理解
- 理解Spark的RDD
- RDD的认识
- spark RDD的原理
- RDD的原理
- RDD的创建
- Spark RDD的转换
- Spark RDD的动作
- pyspark的RDD运算
- RDD.treeAggregate 的用法
- RDD.glom的用法
- 【慕课笔记】第三章 常用的运算符 第1节 什么是运算符
- How to append list to second list
- JVM垃圾回收机制
- rails设置表单默认值&&隐藏表单
- iOS在SDK中打开其他接入应用的解决方案
- RDD的属性
- 【shell】遍历文件夹下所有文件
- Obejct-C 字典
- java ssh整合出的错java.lang.NoSuchMethodError: antlr.collections.AST.getLine()
- A. Bulbs
- 判断一个字符串是否回文
- linux网络编程之socket(六):利用recv和readn函数实现readline函数
- 学好C++和算法
- HDOJ 2138-How many prime numbers