August 6, 2018 By Sreeni Pamidala 5 min read

IBM Cloud and the implementation of a Big Data platform

Setting up a Big Data platform on-premise often requires a significant infrastructure investment to support data ingestion, processing, enrichment, storage, and analytics. Enterprises looking to migrate their applications and Big Data platforms to the cloud (to leverage its agility and scalability and move from a significant capex investment to a paygo model) should consider setting up a Big Data platform on IBM Cloud.

Businesses can reap the benefits of Big Data as a service solution on the cloud by leveraging IBM Message Hub (managed Kafka), IBM Streaming Analytics, IBM Analytics Engine (built on open-source Apache Hadoop and Apache Spark), and IBM Cloud Object Storage. Deploying a cloud solution provides flexibility and ease of use without the headaches of setup or high maintenance costs. Furthermore, data scientists can start providing value right away by accessing and analyzing data sets directly from Cloud Object Storage with IBM Data Science Experience.

Helping a mid-size company migrate to the cloud

A few months ago, the IBM Cloud Garage partnered with a mid-size company to assess and transform their entire application portfolio with the cloud. At the end of our initial assessment, we provided a transformation vision and implementation plan based on the IBM Cloud Garage Method. We also provided a target cloud architecture, implemented squad models, and created an actionable plan to divide projects into multiple Minimal Viable Products with a strangler pattern to modernize and migrate their applications to cloud.

In this article, I share insights from this engagement and target architecture components to help guide the implementation of a Big Data platform on the cloud for others.

Based on this engagement and several others, here are a few key requirements for establishing a secure cloud platform:

  • Having flexibility to scale up and down the infrastructure

  • Establishing a simplified and consolidated technology stack

  • Using cloud-native technologies

  • Ensuring governance is part of the technology stack

  • Using continuous delivery

  • Reducing IT costs

  • Minimizing vendor lock-in technology choices

Motivation to move Big Data stack to the cloud

While many companies today seek to harness Big Data to cultivate new business insights, this mid-size company’s use of Big Data is integral to their core mission and is baked into many of their business decisions. Big Data powers their innovation in customer service by anticipating what customers like and how they will interact. They then learn from these interactions to improve future experiences.

Their current Big Data platform was adequate but fairly expensive to maintain. It was also lagging behind current software and hardware technologies due to multiple acquisitions and several integrations. Their platform direly needed a technology stack upgrade and an update to processes. An update would enable faster innovation, provide data governance, reduce maintenance costs, and, most importantly, create a single source of truth to resolve data inconsistencies—inevitably increasing their technology’s business value.

Assessing the current application portfolio and drafting target deployment models

Before executing a cloud migration and embarking on digital transformation, an organization must understand their long-term business goals, pain points, archaeology of application and data infrastructure portfolio, and individual structure, operations, and processes. The IBM Cloud Garage performed a comprehensive review of the client’s applications supporting the core business functions, grouping them into different categories to evaluate against various cloud deployment models. As previously discussed, this client’s Big Data platform is one of their core components for all their applications. Moreover, in our assessment, we discovered their platform was built using a mix of various technologies over time, which presented a series of complexities to consider.

Target cloud architecture for the Big Data platform

When we joined the client, they had already started building a target architecture model that applied leading open-source technologies. In doing so, however, the client planned on implementing a roll-your-own technology stack to a Big Data platform on the cloud without leveraging any of the cloud-native services that allow for rapid provisioning (such as Hadoop and Spark clusters) or for flexibility for data at rest with the Object storage.

Noting their goal of embracing open-source technologies, our team proposed a new target architecture that would still meet their key requirements. Our proposed architecture was split into three major categories to address the data flows:

  1. Real-time

  2. Near real-time

  3. Batch jobs execution

First off, all three workflows required data ingestion. We proposed consolidating all data ingestions through Kafka to ingest large volumes of streaming data across different channels. Although their existing platform already used Kafka and other custom ingestion sources, we recommended consolidating multiple ingestions into Kafka and leveraging it as a service on IBM Cloud through IBM Message Hub. Kafka provides guaranteed delivery for streaming messages for real-time application. Since Kafka is an open-source technology with no vendor lock-in, the client would have flexibility in vendor choice.

Second, we proposed IBM Streaming Analytics on IBM Cloud to combine real-time and near real-time application use cases for message processing. IBM Streams is technology proven to scale for up to hundreds of millions of users. Although Streams is not open source, it supports the development of applications in Java, Scala, and Python. IBM Streams also supports Apache Beam, which is a programming model that allows users to create platform-agnostic streaming applications. Operators are the lowest-level Streams programming, which might be an adapter like read/write from Kafka and be able to apply a machine-learning model to make predictions.

In case of real-time/near real-time data ingestion, we recommended IBM Streaming Analytics to process the messages. Consumers of IBM Streams output will go into the downstream component—Druid, elastic stack, Cassandra, or HDFS, depending on the workflow. Additionally, the decision engine in the data pipeline will be triggered through the IBM Streams operator based on the type of event. The events data can be recorded into a Cassandra data store (deployed on a customer cloud account) for historical purposes and for real-time/near real-time decision making. The client was already using Druid as a column data store, and we kept it for OLAP queries, which can provide near real-time dashboarding capability for the interactive user applications needs.

The other important technology decision for the target architecture was centered on Hadoop clusters, which can be either deployed as an IaaS solution or as a cloud-native offering through IBM Analytics Engine. We recommended IBM Analytics engine to provide two key capabilities through a single solution: compute-as-a-service and object storage. A major benefit of IBM Analytics Engine is its independent scalability as opposed to being tightly coupled. This empowers a large number of the organization’s data scientists and their developers to focus on building new algorithms and models to extract insights from massive data sets without worrying about building a permanent cluster. IBM Analytics Engine simplifies the complexity of cluster deployment and provides a straightforward way to spin up Hadoop and Spark clusters on the cloud within minutes.

The role of data governance

Implementing data governance was one of the client’s key requirements in migrating their data platform to the cloud. Our team recommended IBM Data Catalog and Data Refinery, which are now part of the IBM Watson Knowledge Catalog. The Knowledge Catalog provides a broad range of capabilities and tools for data cataloging, discovery, findability, and governance. IBM Watson Knowledge Catalog includes built-in data discovery algorithms that utilize machine learning to auto-classify the contents of each data set and a governance policy manager and engine. When you add the data to the enterprise catalog, for example, sensitive data will automatically become classified. The Catalog also immediately redacts or masks any data that users are not allowed to see, aligning it to an organization’s data governance policies.

IBM Data Refinery is embedded into the IBM Watson Knowledge Catalog and Watson Studio, allowing data scientists to connect to data sources regardless of where the data resides, explore data, and use a wide range of transformations to cleanse and transform data into the format necessary for analysis.

IBM Data Science Experience (DSX), which is also now part of Watson Studio, provides a tooling that data scientists can focus on analyzing data. Creating and running models and spark jobs can be invoked through DSX.

Big data on cloud = no brainer

Implementing a Big Data platform stack on the cloud can provide flexibility, agility, and innovation for the enterprise. The role of IT infrastructure has changed from a cost center to one that is extremely flexible and innovative. By leveraging the cloud and a Big Data platform in the cloud, IT can become an innovation team-driving value.

Schedule your complimentary IBM Cloud Garage consultation to get started.

Amit Rai, Big Data Platform Architect, supported this Garage engagement, and he contributed to the article.

Was this article helpful?

More from

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

The power of the mainframe and cloud-native applications 

4 min read - Mainframe modernization refers to the process of transforming legacy mainframe systems, applications and infrastructure to align with modern technology and business standards. This process unlocks the power of mainframe systems, enabling organizations to use their existing investments in mainframe technology and capitalize on the benefits of modernization. By modernizing mainframe systems, organizations can improve agility, increase efficiency, reduce costs, and enhance customer experience.  Mainframe modernization empowers organizations to harness the latest technologies and tools, such as cloud computing, artificial intelligence,…

Modernize your mainframe applications with Azure

4 min read - Mainframes continue to play a vital role in many businesses' core operations. According to new research from IBM's Institute for Business Value, a significant 7 out of 10 IT executives believe that mainframe-based applications are crucial to their business and technology strategies. However, the rapid pace of digital transformation is forcing companies to modernize across their IT landscape, and as the pace of innovation continuously accelerates, organizations must react and adapt to these changes or risk being left behind. Mainframe…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters