RDD的属性

来源:互联网 发布:松下触摸屏软件说明书 编辑:程序博客网 时间:2024/06/03 18:47
RDD resilient Distributed Dataset
  • properties:
  1. Immutable 
  2. lazy evaluated
  3. Cacheable
  4. Type inferred

What's Immutable?
  1. once created never changes
  2. Big data by default immutable in nature
  3. Immutability helps to: (1) Parallelize; (2) Caching

Why Big Data is immutable?
  1. Parallelize for free, no need to lock;
  2. Caching is safe, no worry for other change
  3. immutability is about value not about reference

Immutability in collections
  1. uses transformation for change.  e.g. MAP
  2. creates a new copy of collection leaves collection intact.
  3. uses loop for updating mutable collections in place

Chanllenges of  Immutability
  1. good for parallelism but no good for space
  2. multiple transformations result in: (1) Multiple of copies of data; (2) multiple passes of data
  3. poor performance for multiple of copies and passes of data.

Get lazy for the chanllenges
  1. don't computing transformations till it's need
  2. defers evaluation
  3. separate execution from evaluation
  4. multiple transformations are combined in one

Laziness and immutability
  1. you can be lazy only if the underneath data is immutable
  2. you cannot combine transformation if transformation has side effect
  3. combining laziness and immutability gives better performance and distributed processing

Chanllenges of Laziness   :type inference
  1. Laziness poses chanllenges in terms of data type
  2. if laziness deters execution, determining the type of variable becomes chanllenging
  3. if we can't determine the right type, it allows to have semantic issues
  4. running big data programs and getting semantics errors are not fun.

Type inference
  1. part of compiler to determining the type by value
  2. as all the transformation are side effect free, we can determine the type by operation;  v1.count() inferred as Int
  3. every transformation has specific return type; map array gets array
  4. having type inference relieves you think about representation for many transforms

Caching
  1. immutable data allows you to cache data for long time
  2. lazy transformation allows to recreate data on failure; from linear
  3. transformations can be saved also; as linear 
  4. caching data improves execution engine performance

RDD means big collection of data with above properties.


0 0