What is a Resilient Distributed Dataset (RDD)?

3D conceptual rendering of various shapes

Authors

James Holdsworth

Content Writer

Matthew Kosinski

Staff Editor

IBM Think

What is a Resilient Distributed Dataset (RDD)?

A Resilient Distributed Dataset (RDD) is an immutable, fault-tolerant collection of elements that can be distributed across multiple cluster nodes to be processed in parallel. RDDs are the basic data structure within the open source data processing engine Apache Spark.

Spark was developed to address shortcomings in MapReduce, a programming model for “chunking” a large data processing task into smaller parallel tasks.

MapReduce can be slow and inefficient. It requires replication (maintaining multiple copies of data in different locations), serialization (coordinating access to resources used by more than one program) and intense I/O (input/output of disk storage). 

Spark specifically reduces unnecessary processing. Whereas MapReduce writes intermediate data to disk, Spark uses RDDs to cache and compute data in memory. The result is that Spark’s analytics engine can process data 10–100 times faster than MapReduce.1

RDD and Apache Spark

Apache Spark is a fast, open source, large-scale data-processing engine often used for machine learning (ML) and artificial intelligence (AI) applications. Spark can be viewed as an improvement on Hadoop, more specifically, on Hadoop's native data processing framework, MapReduce. 

Spark scales by distributing data-processing workflows across large clusters of computers, with built-in support for parallel computing on multiple nodes and fault tolerance.

It includes application programming interfaces (APIs) for common data science and data engineering programming languages, including Java™, Python (PySpark), Scala and R.

Spark uses RDDs to manage and process data. Each RDD is divided into logical partitions, which can be computed on different cluster nodes simultaneously. Users can perform 2 types of RDD operations: transformations and actions.

  • Transformations are operations that create a new RDD.

  • Actions instruct Spark to apply computation and pass the result back to the Spark driver, the process that manages Spark jobs.

Spark performs transformations and actions on RDDs in memory—the key to Spark’s speed. Spark can also store data in memory or write the data to disk for added persistence. 

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

How RDD works

Resilient Distributed Datasets are resilient and distributed. That means:

Resilient

RDDs are called "resilient" because they track data lineage information so that lost data can be rebuilt if there is a failure, making RDDs highly fault-tolerant.

As an example of this data resilience, consider an executor core that is lost during the processing of an RDD partition. The driver would detect that failure, and that partition would be reassigned to a different executor core.

Distributed

RDDs are called "distributed" because they are split into smaller groups of data that can be distributed to different compute nodes and processed simultaneously.

In addition to these 2 core characteristics, RDD has other features that contribute to its importance and operations in Spark.

In-memory computation

Many data processing frameworks—and MapReduce in particular—must perform multiple read or write operations from external storage systems, slowing their performance. RDD helps Apache Spark solve this problem.

RDD reduces disk I/O by using in-memory computation that stores intermediate results from iterative operations in random access memory (RAM). Using in-memory computation and storage can support faster access and near real-time processing.

RDDs can also help speed up training time for machine learning algorithms and the processing of large-scale big data analytics. The use of in-memory computation can reduce the time required to access data storage.

Lazy evaluation

In Spark, all transformations—operations applied to create a new RDD—are “lazy,” which is to say that the data is not loaded or computed immediately.

Instead, transformations are tracked in a directed acyclic graph (DAG) and run only when there is a specific call to action for a driver program.

The driver program directs the primary function and operations for cluster computing on Spark jobs, such as aggregation, collecting, counting or saving output to a file system.

Dozens of possible actions and transformations include aggregateByKey, countByKey, flatMap, groupByKey, reduceByKey and sortbyKey.

Lazy evaluation helps optimize data processing pipelines by eliminating unnecessary processing and clipping of unnecessary computations.

Partitioning

Spark automatically partitions RDDs across multiple nodes so that it can process huge volumes of data that would not fit on a single node. To help avoid corruption, each single partition is stored on one node rather than distributed across multiple nodes. 

RDD enables organizations to define the placement of compute partitions so that tasks can run close to the required data. This placement helps increase processing speed.

In addition, the number of executors (computers that perform tasks as assigned by the driver) in the cluster can be increased to enhance parallelism in the system. The level of parallelism in the output depends on the quantity of partitions in the parent RDD. 

RDDs can be created in logical partitions across a cluster to enable parallel operations on several nodes. The RDDs can be created from various stable storage sources, such as Amazon Web Services (AWS) S3, Hadoop Distributed File System (HDFS)Apache HBase and Cassandra. They can also be created through programming languages such as Scala and Python.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Persistence

Spark RDD can cache datasets in memory across operations. Every node can store the partitions that it has computed in memory and reuse them for subsequent actions on the dataset or resulting datasets. This persistence can greatly speed processing.

Spark also provides users with interactive storage options, giving them control over how data is stored. Data can be stored in memory, on disk or a mixture of both.

Immutability

RDDs are immutable, meaning they cannot be modified after creation. Immutability helps the data remain stable over time throughout multiple operations.

It also makes it easier and safer to share data across multiple processes, and it helps protect against corruption that can be caused by simultaneous updates from different threads. 

While RDDs are immutable, users can create new RDDs by applying transformations to existing ones, which allows for datasets to be updated without altering the original data.

Capacity for unstructured data

RDD can process both unstructured and structured data. When processing unstructured data, information can be drawn from multiple types of databases, media streams or text files without the need for a fixed schema or creating a DataFrame.

That said, users can create DataFrames in Spark, which enables them to take advantage of certain optimizations for improved performance.

Related solutions
IBM® watsonx.data™

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data management software and solutions

Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.

Discover data management solutions
Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Explore data and AI consulting services
Take the next step

Unify all your data for AI and analytics with IBM® watsonx.data™. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore data management solutions
Footnote

1 Apache Spark™, Apache Software Foundation, 20 December 2024.