What is a Resilient Distributed Dataset (RDD)?

3D conceptual rendering of various shapes

Authors

James Holdsworth

Content Writer

Staff Editor

IBM Think

What is a Resilient Distributed Dataset (RDD)?

A Resilient Distributed Dataset (RDD) is an immutable, fault-tolerant collection of elements that can be distributed across multiple cluster nodes to be processed in parallel. RDDs are the basic data structure within the open source data processing engine Apache Spark.

Spark was developed to address shortcomings in MapReduce, a programming model for “chunking” a large data processing task into smaller parallel tasks.

MapReduce can be slow and inefficient. It requires replication (maintaining multiple copies of data in different locations), serialization (coordinating access to resources used by more than one program) and intense I/O (input/output of disk storage).

Spark specifically reduces unnecessary processing. Whereas MapReduce writes intermediate data to disk, Spark uses RDDs to cache and compute data in memory. The result is that Spark’s analytics engine can process data 10–100 times faster than MapReduce.¹

RDD and Apache Spark

Apache Spark is a fast, open source, large-scale data-processing engine often used for machine learning (ML) and artificial intelligence (AI) applications. Spark can be viewed as an improvement on Hadoop, more specifically, on Hadoop's native data processing framework, MapReduce.

Spark scales by distributing data-processing workflows across large clusters of computers, with built-in support for parallel computing on multiple nodes and fault tolerance.

It includes application programming interfaces (APIs) for common data science and data engineering programming languages, including Java™, Python (PySpark), Scala and R.

Spark uses RDDs to manage and process data. Each RDD is divided into logical partitions, which can be computed on different cluster nodes simultaneously. Users can perform 2 types of RDD operations: transformations and actions.

Transformations are operations that create a new RDD.
Actions instruct Spark to apply computation and pass the result back to the Spark driver, the process that manages Spark jobs.

Spark performs transformations and actions on RDDs in memory—the key to Spark’s speed. Spark can also store data in memory or write the data to disk for added persistence.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

How RDD works

Resilient Distributed Datasets are resilient and distributed. That means:

Resilient

RDDs are called "resilient" because they track data lineage information so that lost data can be rebuilt if there is a failure, making RDDs highly fault-tolerant.

As an example of this data resilience, consider an executor core that is lost during the processing of an RDD partition. The driver would detect that failure, and that partition would be reassigned to a different executor core.

Distributed

RDDs are called "distributed" because they are split into smaller groups of data that can be distributed to different compute nodes and processed simultaneously.

In addition to these 2 core characteristics, RDD has other features that contribute to its importance and operations in Spark.

In-memory computation

Many data processing frameworks—and MapReduce in particular—must perform multiple read or write operations from external storage systems, slowing their performance. RDD helps Apache Spark solve this problem.

RDD reduces disk I/O by using in-memory computation that stores intermediate results from iterative operations in random access memory (RAM). Using in-memory computation and storage can support faster access and near real-time processing.

RDDs can also help speed up training time for machine learning algorithms and the processing of large-scale big data analytics. The use of in-memory computation can reduce the time required to access data storage.

Lazy evaluation

In Spark, all transformations—operations applied to create a new RDD—are “lazy,” which is to say that the data is not loaded or computed immediately.

Instead, transformations are tracked in a directed acyclic graph (DAG) and run only when there is a specific call to action for a driver program.

The driver program directs the primary function and operations for cluster computing on Spark jobs, such as aggregation, collecting, counting or saving output to a file system.

Dozens of possible actions and transformations include aggregateByKey, countByKey, flatMap, groupByKey, reduceByKey and sortbyKey.

Lazy evaluation helps optimize data processing pipelines by eliminating unnecessary processing and clipping of unnecessary computations.

Partitioning

Spark automatically partitions RDDs across multiple nodes so that it can process huge volumes of data that would not fit on a single node. To help avoid corruption, each single partition is stored on one node rather than distributed across multiple nodes.

RDD enables organizations to define the placement of compute partitions so that tasks can run close to the required data. This placement helps increase processing speed.

In addition, the number of executors (computers that perform tasks as assigned by the driver) in the cluster can be increased to enhance parallelism in the system. The level of parallelism in the output depends on the quantity of partitions in the parent RDD.

RDDs can be created in logical partitions across a cluster to enable parallel operations on several nodes. The RDDs can be created from various stable storage sources, such as Amazon Web Services (AWS) S3, Hadoop Distributed File System (HDFS), Apache HBase and Cassandra. They can also be created through programming languages such as Scala and Python.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

Persistence

Spark RDD can cache datasets in memory across operations. Every node can store the partitions that it has computed in memory and reuse them for subsequent actions on the dataset or resulting datasets. This persistence can greatly speed processing.

Spark also provides users with interactive storage options, giving them control over how data is stored. Data can be stored in memory, on disk or a mixture of both.

Immutability

RDDs are immutable, meaning they cannot be modified after creation. Immutability helps the data remain stable over time throughout multiple operations.

It also makes it easier and safer to share data across multiple processes, and it helps protect against corruption that can be caused by simultaneous updates from different threads.

While RDDs are immutable, users can create new RDDs by applying transformations to existing ones, which allows for datasets to be updated without altering the original data.

Capacity for unstructured data

RDD can process both unstructured and structured data. When processing unstructured data, information can be drawn from multiple types of databases, media streams or text files without the need for a fixed schema or creating a DataFrame.

That said, users can create DataFrames in Spark, which enables them to take advantage of certain optimizations for improved performance.

Increasing AI Adoption with AI-Ready Data

Gain actionable insights on how to invest in AI technology for data and preparing data for AI.

Resources

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Legal overhead turned into strategic insight

Learn how an AI-powered legal agent helps accelerate decision-making, reduce manual work and improve compliance.

AI Academy: Building a data strategy for enterprise AI

In this episode, Cathy Reese explains how organizations today need a data strategy that’s ready for advanced AI, which will require them to harness their highest quality data assets.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Cost of a Data Breach Report 2025

Data breach costs have hit a new high. Get up-to-date insights into cybersecurity threats and their financial impacts on organizations.

The data leader’s guide to AI-ready data

Understand the actionable steps data leaders can take to overcome data challenges, establish the groundwork for a trusted data foundation, and help get your organization’s data ready for AI.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Footnote

¹ Apache Spark™, Apache Software Foundation, 20 December 2024.