Spark学习笔记 --- RDD详解

来源：互联网发布：用c语言编写的文件编辑：程序博客网时间：2024/05/16 19:18

原文：

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.

Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more.

Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.

Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which can dominates application execution time.

The following illustration explains how the current framework works while doing the interactive queries on MapReduce.

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster.

Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

==============================================================================================================================

译文：

首先分几块讲解：

Resilient Distributed Datasets（弹性分布式数据集）

Data Sharing is Slow in MapReduce（数据分享在MapReduce中使用慢）

Iterative Operations on MapReduce（基于MapReduce中的迭代）

Interactive Operations on MapReduce（基于MapReduce中的交互操作）

Data Sharing using Spark RDD（利用RDD做数据共享）

Iterative Operations on Spark RDD（基于RDD中的迭代）

Interactive Operations on Spark RDD（基于RDD中的交互操作）

=================================================================================================================================

Resilient Distributed Datasets（弹性分布式数据集）

RDDS是一个存储spark结构化数据的基本容器。

RDD中每一个数据集合都在分布式的结点的逻辑分区中，并且也可以包括Python, R, Scala对象定义的类型。

Data Sharing is Slow in MapReduce（数据分享在MapReduce中使用慢）

这段主要介绍了，MapReduce在数据分享处,功能的单一性，性能慢。

Iterative Operations on MapReduce（基于MapReduce中的迭代）

这段主要介绍了由于磁盘IO等瓶颈。迭代性能慢。

Interactive Operations on MapReduce（基于MapReduce中的交互操作）

这段介绍了，由于数据分享，副本储存的问题，导致交互也比较慢。

Data Sharing using Spark RDD（利用RDD做数据共享）

这段介绍了在内存中做并行计算等使数据共享更加完善。

Iterative Operations on Spark RDD（基于RDD中的迭代）

这段介绍了在同理在内存中计算数据更快

Interactive Operations on Spark RDD（基于RDD中的交互操作）

一是利用队列共享数据，二是利用内存极大的提升效率。

0 0