A Resilient Distributed Dataset (RDD) is an immutable, fault-tolerant collection of elements that can be distributed across multiple cluster nodes to be processed in parallel. RDDs are the basic data structure within the open source data processing engine Apache Spark.
Spark was developed to address shortcomings in MapReduce, a programming model for “chunking” a large data processing task into smaller parallel tasks.
MapReduce can be slow and inefficient. It requires replication (maintaining multiple copies of data in different locations), serialization (coordinating access to resources used by more than one program) and intense I/O (input/output of disk storage).
Spark specifically reduces unnecessary processing. Whereas MapReduce writes intermediate data to disk, Spark uses RDDs to cache and compute data in memory. The result is that Spark’s analytics engine can process data 10–100 times faster than MapReduce.1
Apache Spark is a fast, open source, large-scale data-processing engine often used for machine learning (ML) and artificial intelligence (AI) applications. Spark can be viewed as an improvement on Hadoop, more specifically, on Hadoop's native data processing framework, MapReduce.
Spark scales by distributing data-processing workflows across large clusters of computers, with built-in support for parallel computing on multiple nodes and fault tolerance.
It includes application programming interfaces (APIs) for common data science and data engineering programming languages, including Java™, Python (PySpark), Scala and R.
Spark uses RDDs to manage and process data. Each RDD is divided into logical partitions, which can be computed on different cluster nodes simultaneously. Users can perform 2 types of RDD operations: transformations and actions.
Spark performs transformations and actions on RDDs in memory—the key to Spark’s speed. Spark can also store data in memory or write the data to disk for added persistence.
Resilient Distributed Datasets are resilient and distributed. That means:
RDDs are called "resilient" because they track data lineage information so that lost data can be rebuilt if there is a failure, making RDDs highly fault-tolerant.
As an example of this data resilience, consider an executor core that is lost during the processing of an RDD partition. The driver would detect that failure, and that partition would be reassigned to a different executor core.
RDDs are called "distributed" because they are split into smaller groups of data that can be distributed to different compute nodes and processed simultaneously.
In addition to these 2 core characteristics, RDD has other features that contribute to its importance and operations in Spark.
Many data processing frameworks—and MapReduce in particular—must perform multiple read or write operations from external storage systems, slowing their performance. RDD helps Apache Spark solve this problem.
RDD reduces disk I/O by using in-memory computation that stores intermediate results from iterative operations in random access memory (RAM). Using in-memory computation and storage can support faster access and near real-time processing.
RDDs can also help speed up training time for machine learning algorithms and the processing of large-scale big data analytics. The use of in-memory computation can reduce the time required to access data storage.
In Spark, all transformations—operations applied to create a new RDD—are “lazy,” which is to say that the data is not loaded or computed immediately.
Instead, transformations are tracked in a directed acyclic graph (DAG) and run only when there is a specific call to action for a driver program.
The driver program directs the primary function and operations for cluster computing on Spark jobs, such as aggregation, collecting, counting or saving output to a file system.
Dozens of possible actions and transformations include aggregateByKey, countByKey, flatMap, groupByKey, reduceByKey and sortbyKey.
Lazy evaluation helps optimize data processing pipelines by eliminating unnecessary processing and clipping of unnecessary computations.
Spark automatically partitions RDDs across multiple nodes so that it can process huge volumes of data that would not fit on a single node. To help avoid corruption, each single partition is stored on one node rather than distributed across multiple nodes.
RDD enables organizations to define the placement of compute partitions so that tasks can run close to the required data. This placement helps increase processing speed.
In addition, the number of executors (computers that perform tasks as assigned by the driver) in the cluster can be increased to enhance parallelism in the system. The level of parallelism in the output depends on the quantity of partitions in the parent RDD.
RDDs can be created in logical partitions across a cluster to enable parallel operations on several nodes. The RDDs can be created from various stable storage sources, such as Amazon Web Services (AWS) S3, Hadoop Distributed File System (HDFS), Apache HBase and Cassandra. They can also be created through programming languages such as Scala and Python.
Spark RDD can cache datasets in memory across operations. Every node can store the partitions that it has computed in memory and reuse them for subsequent actions on the dataset or resulting datasets. This persistence can greatly speed processing.
Spark also provides users with interactive storage options, giving them control over how data is stored. Data can be stored in memory, on disk or a mixture of both.
RDDs are immutable, meaning they cannot be modified after creation. Immutability helps the data remain stable over time throughout multiple operations.
It also makes it easier and safer to share data across multiple processes, and it helps protect against corruption that can be caused by simultaneous updates from different threads.
While RDDs are immutable, users can create new RDDs by applying transformations to existing ones, which allows for datasets to be updated without altering the original data.
RDD can process both unstructured and structured data. When processing unstructured data, information can be drawn from multiple types of databases, media streams or text files without the need for a fixed schema or creating a DataFrame.
That said, users can create DataFrames in Spark, which enables them to take advantage of certain optimizations for improved performance.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Successfully scale AI with the right strategy, data, security and governance in place.
1 Apache Spark™, Apache Software Foundation, 20 December 2024.