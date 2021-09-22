Apache Spark (Spark) easily handles large-scale data sets and is a fast, general-purpose clustering system that is well-suited for PySpark. It is designed to deliver the computational speed, scalability, and programmability required for big data—specifically for streaming data, graph data, analytics, machine learning, large-scale data processing, and artificial intelligence (AI) applications.

Spark's analytics engine processes data 10 to 100 times faster than some alternatives, such as Hadoop (link resides outside ibm.com) for smaller workloads. It scales by distributing processing workflows across large clusters of computers, with built-in parallelism and fault tolerance. It even includes APIs for programming languages that are popular among data analysts and data scientists, including Scala, Java, Python, and R.

Spark is often compared to Apache Hadoop, and specifically to Hadoop MapReduce, Hadoop’s native data-processing component. The chief difference between Spark and MapReduce is that Spark processes and keeps the data in memory for subsequent steps—without writing to or reading from disk—which results in dramatically faster processing speeds. (You’ll find more on how Spark compares to and complements Hadoop elsewhere in this article.)

Spark was developed in 2009 at UC Berkeley’s AMPLab. Today, it is maintained by the Apache Software Foundation and boasts the largest open-source community in big data, with over 1,000 contributors. It’s also included as a core component of several commercial big data offerings.

