What does the rise of Apache Spark mean for Hadoop?

By: Andrea Braida

What does the rise of Apache Spark mean for Hadoop?

There are many blogs and analyst reports that have provocative titles like “Why the days are numbered for Apache Hadoop as we know it,” or “Does Spark Mean the End of Hadoop?”

Spark

Many of these articles appear to be heavily sensationalized and ignore the reality that Apache Spark actually integrates deeply with Hadoop. Yes, Spark runs in a standalone mode or on other distributed environments like Mesos, AWS, or Cassandra. But the majority of Spark’s early adoption has been in-concert with Hadoop. While Spark is an impressively fast and advanced general purpose-processing framework, it is not a data storage system.

Spark was designed to work with Hadoop’s distributed file system to improve upon MapReduce technology. Spark makes it easier for developers and data scientists to work with, iterate over data and deliver advanced insights faster. But Spark does not replace Hadoop; Spark enhances Hadoop.

Apache Spark improves Hadoop

Apache Spark has been an active open source project since 2010, but its popularity surged around the middle of 2014. Today, it is one of the most active projects in the Apache Software Foundation.

The major reason for its rapid rise in popularity is that it addresses the weak points of Hadoop-MapReduce.

Programming language choice. Spark also allows you to use Java, Python, Scala or R.

  • Improved performance. 4x to 100x for the same applications as compared to MapReduce. In-memory data processing capability, which is based on the Spark concept of a Resilient Distributed Dataset (RDD). RDD can greatly speed up workloads.

  • Ease of use. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.

  • Multiple workload types. With Spark’s in-memory capabilities, interactive workloads (like running SQL queries) and iterative algorithms (running machine learning models against the same data set) are possible. Spark Streaming also enables the running of micro-batch workloads.

  • Runs everywhere. Spark runs on Hadoop, Mesos, standalone, or in the cloud.

Integrated libraries and use cases

Spark’s ease of use and flexibility is enhanced by a set of powerful, higher-level libraries that can be seamlessly used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX. Additional Spark libraries and extensions are currently under development as well.

Here are a few of Spark’s major use cases over MapReduce:

  • Iterative algorithms in machine learning

  • Interactive data mining and data processing

  • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

  • Stream processing – Log processing and fraud detection in live streams for alerts, aggregates and analysis

  • Sensor data processing – Where data is fetched and joined from multiple sources

It is clear that Spark fills many of the gaps that have been identified in Hadoop itself. The key point here is that it’s not “Spark or Hadoop,” but “Spark and Hadoop.”

Haven’t found the time or money to try Spark?

Despite the unbelievable potential of Spark as we described above, widespread adoption has been slowed by time and resource constraints. How do we overcome these obstacles to innovation? One very good way to accelerate your Spark installation and lower the cost of deployment is to use Spark as-a-managed-service or Spark in the Cloud. Spark as-a-managed-service allows developers and data scientists to quickly start exploring with no long-term commitment or risk, scale up or down as needed, and take advantage of pay-as-you-go and always-on service.

One of only a few managed Spark services, IBM’s Analytics for Apache Spark service is qualitatively better:

  • A powerful interactive notebooks environment, with additional ways to access and analyze your data coming soon

  • Available on the Bluemix platform, a rich, open, and expanding ecosystem of data services and deployment options

  • Backed by more data centers in more countries than any other cloud provider, 24×7 support, and commitment to open source Spark, including the IBM Spark Technology Center, a Spark brain-trust including data scientists and designers who are innovating Spark.

Don’t take our word for it. Try it for yourself now: Register for a 30-day trial on Bluemix.

Learn more about IBM’s Spark investment: www.ibm.com/analytics.

Be the first to hear about news, product updates, and innovation from IBM Cloud