Software-defined computing

6 must-haves for making Apache Spark live up to the hype

Share this post:

What’s driving the viral growth of the cluster-computing framework Fortune calls the “Taylor Swift” of big data software? Performance. Enterprises able to harness Apache Spark can run critical analytic applications up to 100 times faster than before. Once deployed, the framework helps organizations make crucial decisions rapidly as they process ever-growing volumes of data needed to remain competitive—from information sources about competitors to the Internet of Things (IoT).

What’s more, organizations can use Spark to process more data with infrastructure and tools already in place. But getting the open source framework to live up to its star billing takes an understanding of six vital deployment and management requirements.

In-memory processing cranks out the hits
Business users and developers alike swoon over Spark’s in-memory processing, which can run programs up to 100 times faster than Hadoop MapReduce in memory—or 10 times faster than on disk.

The telecommunications, biotech and finance industries are among Spark’s growing legions of followers—and count IBM as a fan, too. By leveraging Spark’s in-memory processing for frequently accessed information, the company has simplified the architecture of some of its most widely used software solutions and cloud data services, such as IBM BigInsights, IBM Streams and IBM SPSS.

Along the way, IBM customers have benefited from the new sound of rapid big data transformation. Take Nova Scotia-based SolutionInc, an international provider of public wifi and wired access. Using IBM Analytics for Apache Spark on IBM Bluemix, SolutionInc extracted data sets such as peak volume times, busiest locations, route patterns and device types critical to customer satisfaction.

Ease of use quickens the production rhythm
Developers who listen to increased productivity expectations are driving much of Spark’s popularity. 77 percent of users cite ease of use as a primary reason for adoption. The framework can radically simplify the process of developing intelligent apps, and Spark comes loaded with libraries and intuitive application programming interfaces (APIs) for a variety of programming languages—as well as a rich set of high-level tools to support big data analytics.

Plus, Spark allows reuse of the same code for batch processing, running ad hoc queries or joining streams against historical data. It’s a hit for developers working with a broad range of data from streaming analytics to machine-learning algorithms: Spark allows user programs to load data into a cluster’s memory for repeated queries.

6 keys to making Spark sing in any environment

Getting Spark to perform consistently takes planning. Business and IT professionals must answer six critical questions so the framework resonates with specific goals:

  • Will you be able to support multiple instances of Spark? Doing so will maximize resource utilization, increase performance and scale—and eliminate inefficient silos of resources.
  • Do you know how to best leverage available Spark resources? If you’re running multiple Spark application workloads, service levels will need to be met—while preserving security isolation between Spark instances.
  • How will you minimize administration costs? Optimizing existing hardware usage will help defer incremental capital investment, while overseeing multiple Spark frameworks will diminish the time-consuming review of individual framework metrics.
  • Can you manage fast-moving Spark lifecycles? As an open source project, Spark is a moving target with rapid updates. The ability to deploy different versions of Spark simultaneously in a shared environment will be critical.
  • Can you deliver enterprise-class security with role-based access control? Different users often manage diverse activities. Role-based access control can minimize the risk of a single user causing damage to the entire system.
  • Do you have storage management in place? While Hadoop users will be familiar with the open source Hadoop Distributed File System (HDFS), some organizations may look for additional POSIX compliance.

Understanding and making arrangements for these issues gives organizations a shot at big data stardom as they tap into Spark’s performance. To help, IBM continues to work on solutions that streamline Spark implementations, and provides a number of resources for developers. Find out more here.

More stories

Announcing IBM i 7.4 and Db2 Mirror for i

Power servers, Power Systems, Workload & resource optimization

This post was originally featured in Systems Magazine. Today is an exciting day for those of us who work to create IBM i and all of its related software.  Today is Announcement Day! Of course, we have two of those each year (at least) when we announce technology refreshes. And, today, we are announcing a more

IBM Storage builds on leadership for containers

Cloud object storage, Modern data platforms, Real-time analytics...

Next directions to accelerate cloud native Containers are an increasingly important element in the design techniques for enterprises to develop and deliver applications and services with greater speed and agility as part of their cloud-native transformation. IBM Storage has taken a position in enabling the data management and protection of containers with open-source Kubernetes integrations. more

New smart multicloud storage solutions for businesses

Flash storage, Integrated infrastructure, Storage...

Today IBM is announcing a broad spectrum of innovations, enhancements and new features across our entire storage portfolio, aimed at providing leading-edge solutions for 21st century business and technology challenges. I hear from business leaders every day about their aspirations as well as their challenges and pain points. They demand the highest levels of performance, more