Software-defined computing

6 must-haves for making Apache Spark live up to the hype

Share this post:

What’s driving the viral growth of the cluster-computing framework Fortune calls the “Taylor Swift” of big data software? Performance. Enterprises able to harness Apache Spark can run critical analytic applications up to 100 times faster than before. Once deployed, the framework helps organizations make crucial decisions rapidly as they process ever-growing volumes of data needed to remain competitive—from information sources about competitors to the Internet of Things (IoT).

What’s more, organizations can use Spark to process more data with infrastructure and tools already in place. But getting the open source framework to live up to its star billing takes an understanding of six vital deployment and management requirements.

In-memory processing cranks out the hits
Business users and developers alike swoon over Spark’s in-memory processing, which can run programs up to 100 times faster than Hadoop MapReduce in memory—or 10 times faster than on disk.

The telecommunications, biotech and finance industries are among Spark’s growing legions of followers—and count IBM as a fan, too. By leveraging Spark’s in-memory processing for frequently accessed information, the company has simplified the architecture of some of its most widely used software solutions and cloud data services, such as IBM BigInsights, IBM Streams and IBM SPSS.

Along the way, IBM customers have benefited from the new sound of rapid big data transformation. Take Nova Scotia-based SolutionInc, an international provider of public wifi and wired access. Using IBM Analytics for Apache Spark on IBM Bluemix, SolutionInc extracted data sets such as peak volume times, busiest locations, route patterns and device types critical to customer satisfaction.

Ease of use quickens the production rhythm
Developers who listen to increased productivity expectations are driving much of Spark’s popularity. 77 percent of users cite ease of use as a primary reason for adoption. The framework can radically simplify the process of developing intelligent apps, and Spark comes loaded with libraries and intuitive application programming interfaces (APIs) for a variety of programming languages—as well as a rich set of high-level tools to support big data analytics.

Plus, Spark allows reuse of the same code for batch processing, running ad hoc queries or joining streams against historical data. It’s a hit for developers working with a broad range of data from streaming analytics to machine-learning algorithms: Spark allows user programs to load data into a cluster’s memory for repeated queries.

6 keys to making Spark sing in any environment

Getting Spark to perform consistently takes planning. Business and IT professionals must answer six critical questions so the framework resonates with specific goals:

  • Will you be able to support multiple instances of Spark? Doing so will maximize resource utilization, increase performance and scale—and eliminate inefficient silos of resources.
  • Do you know how to best leverage available Spark resources? If you’re running multiple Spark application workloads, service levels will need to be met—while preserving security isolation between Spark instances.
  • How will you minimize administration costs? Optimizing existing hardware usage will help defer incremental capital investment, while overseeing multiple Spark frameworks will diminish the time-consuming review of individual framework metrics.
  • Can you manage fast-moving Spark lifecycles? As an open source project, Spark is a moving target with rapid updates. The ability to deploy different versions of Spark simultaneously in a shared environment will be critical.
  • Can you deliver enterprise-class security with role-based access control? Different users often manage diverse activities. Role-based access control can minimize the risk of a single user causing damage to the entire system.
  • Do you have storage management in place? While Hadoop users will be familiar with the open source Hadoop Distributed File System (HDFS), some organizations may look for additional POSIX compliance.

Understanding and making arrangements for these issues gives organizations a shot at big data stardom as they tap into Spark’s performance. To help, IBM continues to work on solutions that streamline Spark implementations, and provides a number of resources for developers. Find out more here.

More stories

Storage for the exabyte future

AI, Cloud object storage, Storage

“There is no AI without IA (information architecture)” is a common phrase here at IBM. It describes the business and operation platform every business needs to connect and manage the lifecycle of their AI applications. Data scientists, analytic teams, and line of business need access to the data that helps drive innovation, insight, and ultimately ...read more


The next big leaps for IBM modern data protection

Data security, Multicloud, Storage

Recent analyst research indicates why hybrid multicloud support is becoming increasingly important. According to a 2019 ESG report [1], 67 percent of organizations surveyed currently use public cloud services in their data protection environment. Among those companies, on average 26 percent of their protection environments (measured by amount of data) are housed in the cloud, ...read more


IBM drives innovation in storage for AI and big data, modern data protection and hybrid multicloud

Cloud object storage, Multicloud, Storage

Storage for AI and big data IBM continues to enhance our storage solutions for AI and big data so our clients get the most out of their growing data on premises and in the cloud. Today, IBM announces innovations that allow our clients to leverage more heterogenous data sources and data types for deeper insights ...read more