IBM Systems Lab Services

Are you making the best of your Spark environment?

Share this post:

All companies are in the data business now. By empowering your organization to make data-driven decisions at a high speed, with optimal resource utilization, IT will soon become the data hero that helps shape the future of the business. Organizations from varied spheres are thus enthusiastic about 21st-century data science. “Big data” solutions have been a ubiquitous word in the IT industry for some time, and recently, many organizations are turning to Apache Spark, the new go-to framework for big data analytics.

The Apache Spark community and enterprise Spark adoption have been growing rapidly. Many organizations are now experimenting with Spark as an in-memory engine to accelerate common analytics workloads.

If you’re deploying and using Spark in a production environment, you may encounter challenges such as:

  • Integration of Spark into your existing environment
  • Investment in building skills, tools and workflows
  • Proliferation of numerous ad hoc Spark clusters (“Spark silos”)
    • Dedicated environment – Dev, UAT, Prod
    • Distinct user groups and different security roles
    • Multiple storage tiers for performance and cost saving
  • Fast-moving Spark lifecycles

IBM Spectrum Conductor with Spark has a rich set of features to address these challenges. It provides a complete, enterprise-grade, multi-tenant solution for Apache Spark. As part of the IBM Spectrum Computing portfolio, IBM Spectrum Conductor with Spark is purpose-built for Apache Spark deployments and offers:

  • A service orchestration framework
  • Independent scaling of compute and storage infrastructure
  • Flexible allocation of compute and memory as per application requirements
  • IBM Spectrum Scale storage management for a space-efficient alternative to Hadoop Distributed File System (HDFS)
  • Coexistence of different versions of Spark
  • The ability to run Spark workloads efficiently without Hadoop and its complexity
  • Simplified management, monitoring and reporting
Figure 1: High Level Architecture: Spectrum Conductor with Spark

Figure 1: High Level Architecture: Spectrum Conductor with Spark

Putting Spark to work

As an example of the advantages Spark can offer, let me share a story of a large telecommunications company my team worked with that implemented IBM Spectrum Conductor with Spark for faster analytics, improved licensing costs and a centralized data store.

The company was analyzing data from call detail records (CDR) using its analytics applications for customer account statements, running analysis of location-based data like call latency, call quality and so forth, as well as usage of mobile data services. It was having challenges with infrastructure scalability and performance to meet the growing business needs and therefore looking at viable alternatives.

The organization’s extract, transform and load (ETL) workloads involved transformation and derivation from numerous CSV files consisting of a large number of columns. The source data arrives in .GZ format with 100-200K of CSV files that had to be transformed and derived in batches. All that is to say — a lot of complicated data that isn’t easy to process.

My team in IBM Systems Lab Services proposed a solution using IBM Spectrum Conductor with Spark running on multiple IBM Power Systems compute nodes with IBM Spectrum Scale as a distributed file system. The transformations were implemented in the Scala programming language and deployed on Spectrum Conductor with Spark for efficient processing. You can see the high-level architecture of the solution in the following figure.

Figure 2: Solution diagram

Figure 2: Solution diagram

The proposed solution was implemented to provide the following benefits:

  • Over 1.5 times increase in computation performance on IBM Power Systems compared to x86
  • Efficient input/output operations with IBM Spectrum Scale
  • Improved time to results through efficient resource scheduling with the resource orchestrator (EGO – Enterprise Grid Orchestrator) by maintaining resource pool
  • A cost-effective, scalable solution with two times overall performance over the existing commercial ETL solution
  • Enhanced security through role-based access control between Spark instances
  • Interactive analytics programming capability with Zeppelin and Jupyter notebooks

For more details and support

IBM System Lab Services offers services for IBM Spectrum Conductor with Spark, along with a range of solutions to help you efficiently capture, deliver, manage, protect and reuse data. If you’re interested in talking about how to tear down Spark silos and do an infrastructure overhaul, contact us today.

More IBM Systems Lab Services stories

IBM PowerAI Vision: A visual recognition AI solution

AI, IBM Systems Lab Services, Power Systems

Many organizations are interested in employing deep learning and data science but have a skill and resource gap that impedes adoption of these technologies. To address this need, IBM created an easy deep learning solution specifically for business users. Designed to lower the barrier of entry required to create AI applications, IBM PowerAI Vision — ...read more


Top IBM Power Systems myths: The OpenPOWER Foundation is not really an industry backed consortium

IBM Systems Lab Services, OpenPOWER, Power Systems

There are many misconceptions about IBM Power Systems in the marketplace today, and this blog series is helping to dispel some of the top myths. In my last post, I put aside the myth that the x86 architecture is the de-facto industry standard for all applications and that Power Systems will soon become obsolete. In ...read more


The rise of Open Source Databases

IBM Systems Lab Services, Linux on Power Systems, Power Systems

After many years of working in the IT industry, both as an IT manager in a large telecommunications setup and as a consultant providing solutions to my clients, I’ve come to see a huge interest among users in leveraging more open source software and standards. It comes as no surprise to me that the adoption ...read more