Spark学习笔记 --- RDD详解
来源:互联网 发布:用c语言编写的文件 编辑:程序博客网 时间:2024/05/16 19:18
原文:
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Iterative Operations on MapReduce
Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.
Interactive Operations on MapReduce
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the interactive queries on MapReduce.
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.
Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
==============================================================================================================================
译文:
首先分几块讲解:
Resilient Distributed Datasets(弹性分布式数据集)
Data Sharing is Slow in MapReduce(数据分享在MapReduce中使用慢)
Iterative Operations on MapReduce(基于MapReduce中的迭代)
Interactive Operations on MapReduce(基于MapReduce中的交互操作)
Data Sharing using Spark RDD(利用RDD做数据共享)
Iterative Operations on Spark RDD(基于RDD中的迭代)
Resilient Distributed Datasets(弹性分布式数据集)
Data Sharing is Slow in MapReduce(数据分享在MapReduce中使用慢)
Iterative Operations on MapReduce(基于MapReduce中的迭代)
这段主要介绍了由于磁盘IO等瓶颈。迭代性能慢。Interactive Operations on MapReduce(基于MapReduce中的交互操作)
这段介绍了,由于数据分享,副本储存的问题,导致交互也比较慢。Data Sharing using Spark RDD(利用RDD做数据共享)
Iterative Operations on Spark RDD(基于RDD中的迭代)
- Spark学习笔记 --- RDD详解
- RDD Dependency详解---Spark学习笔记9
- Spark RDD Transformation 详解---Spark学习笔记7
- Spark RDD Action 详解---Spark学习笔记8
- spark学习三 RDD详解
- Spark学习4: RDD详解
- Spark学习笔记二 RDD
- spark RDD解密学习笔记
- Spark学习笔记 --- 什么是RDD
- Spark学习笔记 --- spark RDD加载文件
- Spark学习笔记之<RDD原理>
- Spark学习笔记三(RDD常用操作)
- Spark学习笔记四(RDD Persistency)
- Spark学习笔记(一)--RDD编程
- Spark学习笔记 --- RDD的创建
- Spark学习笔记2:RDD编程
- Spark学习笔记2:RDD编程
- Spark学习笔记1-RDD编程
- C# form属性
- android 锁屏状态下显示activity
- 索引
- Android反编译和二次打包实战
- 2.css三大特性 元素显示方式
- Spark学习笔记 --- RDD详解
- Exchanger
- Android使用PathMeasure实现加载动画
- listview,gridview 邪恶的wrap_content高度属性导致数据错乱
- Markdown 有序列表 多级列表 序号错乱解决
- class对象详解
- 发送邮件
- [php] 实现执行定时任务的方法
- 传染