We often forget how new Spark is. While it was invented much earlier, Apache Spark only became a top-level Apache project in February 2014 (generally indicating it’s ready for anyone to use), which is just 18 months ago. I might have a toothbrush that is older than Apache Spark!
Since then, Spark has generated tremendous interest because the new data processing platforms scales so well, is high performance (up to 100 times faster than alternatives), and is more flexible than other alternatives, both open source and commercial. (If you’re interested, see the trends on both Google searches and Indeed job postings.)
Spark gives the Data Scientist, Business Analyst, and Developer a new platform to manage data and build services as it provides the ability to compute in real-time via in-memory processing. The project is extremely active with ongoing development, and has serious investment from IBM and key players in Silicon Valley.
Tips for getting started with Apache Spark
Given the great potential to revolutionize advanced analytics for big data and modern applications, the IBM Analytics for Apache Spark team is frequently asked for our tips on great resources to help get up-to-speed on Spark.
Below is our team’s list of recommended resources that we share with you in anticipation of the IBM Analytics for Apache Spark open beta:
You have no idea what Spark is and want to at least be informed
- Turning Data into Value — Ion Stoica, Spark Summit 2013 (video & slides)
- How Companies are Using Spark, and Where the Edge in Big Data Will Be — Matei Zaharia (video & slides)
- Spark Fundamentals I (lesson 1 only) — Big Data University
- Quick Start — Spark Apache website
- Overview of Apache Spark (video) and slides — Jim Scott
You want to use Spark and want to understand the basics
- First Steps to Scala — Bill Venners, Martin Odersky, Lex Spoon (creators of Scala)
- Spark Fundamentals I (all lessons) — Big Data University
- Spark Fundamentals II — Big Data University
- Overview — Spark Apache website
- Programming Guide — Spark Apache website
You are familiar with Spark and want to continue learning
- Spark Programming Guide lists component-specific guides under the Programming Guides pulldown (e.g. SparkSQL, MLlib, etc.) and sections on deployment. For example, the Spark SQL and DataFrame Guide
- Advanced Apache Spark (video) and slides
- Tuning and Debugging Spark (video)
- How to Tune Your Apache Spark Jobs — Sandy Ryza
- Introduction to AmpLab Spark Internals (video) — Matei Zaharia
- A Deeper Understanding of Spark Internals (video) and PDF — Aaron Davidson
You are already experienced with Spark and want to reach expert level
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF) — Matei Zaharia, et al.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (PDF) — Matei Zaharia, et al.
- Spark SQL: Relational Data Processing in Spark — Michael Armbrust, et al.
- Functional Programming Principles in Scala — Odersky
- Spark Summit conference session videos
Share this post: