Are you making the best of your Spark environment?

By | 3 minute read | November 1, 2018

Flash Storage

All companies are in the data business now. By empowering your organization to make data-driven decisions at a high speed, with optimal resource utilization, IT will soon become the data hero that helps shape the future of the business. Organizations from varied spheres are thus enthusiastic about 21st-century data science. “Big data” solutions have been a ubiquitous word in the IT industry for some time, and recently, many organizations are turning to Apache Spark, the new go-to framework for big data analytics.

The Apache Spark community and enterprise Spark adoption have been growing rapidly. Many organizations are now experimenting with Spark as an in-memory engine to accelerate common analytics workloads.

If you’re deploying and using Spark in a production environment, you may encounter challenges such as:

  • Integration of Spark into your existing environment
  • Investment in building skills, tools and workflows
  • Proliferation of numerous ad hoc Spark clusters (“Spark silos”)
    • Dedicated environment – Dev, UAT, Prod
    • Distinct user groups and different security roles
    • Multiple storage tiers for performance and cost saving
  • Fast-moving Spark lifecycles

IBM Spectrum Conductor with Spark has a rich set of features to address these challenges. It provides a complete, enterprise-grade, multi-tenant solution for Apache Spark. As part of the IBM Spectrum Computing portfolio, IBM Spectrum Conductor with Spark is purpose-built for Apache Spark deployments and offers:

  • A service orchestration framework
  • Independent scaling of compute and storage infrastructure
  • Flexible allocation of compute and memory as per application requirements
  • IBM Spectrum Scale storage management for a space-efficient alternative to Hadoop Distributed File System (HDFS)
  • Coexistence of different versions of Spark
  • The ability to run Spark workloads efficiently without Hadoop and its complexity
  • Simplified management, monitoring and reporting
Figure 1: High Level Architecture: Spectrum Conductor with Spark

Figure 1: High Level Architecture: Spectrum Conductor with Spark

Putting Spark to work

As an example of the advantages Spark can offer, let me share a story of a large telecommunications company my team worked with that implemented IBM Spectrum Conductor with Spark for faster analytics, improved licensing costs and a centralized data store.

The company was analyzing data from call detail records (CDR) using its analytics applications for customer account statements, running analysis of location-based data like call latency, call quality and so forth, as well as usage of mobile data services. It was having challenges with infrastructure scalability and performance to meet the growing business needs and therefore looking at viable alternatives.

The organization’s extract, transform and load (ETL) workloads involved transformation and derivation from numerous CSV files consisting of a large number of columns. The source data arrives in .GZ format with 100-200K of CSV files that had to be transformed and derived in batches. All that is to say — a lot of complicated data that isn’t easy to process.

My team in IBM Systems Lab Services proposed a solution using IBM Spectrum Conductor with Spark running on multiple IBM Power Systems compute nodes with IBM Spectrum Scale as a distributed file system. The transformations were implemented in the Scala programming language and deployed on Spectrum Conductor with Spark for efficient processing. You can see the high-level architecture of the solution in the following figure.

Figure 2: Solution diagram

Figure 2: Solution diagram

The proposed solution was implemented to provide the following benefits:

  • Over 1.5 times increase in computation performance on IBM Power Systems compared to x86
  • Efficient input/output operations with IBM Spectrum Scale
  • Improved time to results through efficient resource scheduling with the resource orchestrator (EGO – Enterprise Grid Orchestrator) by maintaining resource pool
  • A cost-effective, scalable solution with two times overall performance over the existing commercial ETL solution
  • Enhanced security through role-based access control between Spark instances
  • Interactive analytics programming capability with Zeppelin and Jupyter notebooks

For more details and support

IBM System Lab Services offers services for IBM Spectrum Conductor with Spark, along with a range of solutions to help you efficiently capture, deliver, manage, protect and reuse data. If you’re interested in talking about how to tear down Spark silos and do an infrastructure overhaul, contact us today.