Implementing a Big Data Platform on IBM Cloud
6 min read
IBM Cloud and the implementation of a Big Data platform
Setting up a Big Data platform on-premise often requires a significant infrastructure investment to support data ingestion, processing, enrichment, storage, and analytics. Enterprises looking to migrate their applications and Big Data platforms to the cloud (to leverage its agility and scalability and move from a significant capex investment to a paygo model) should consider setting up a Big Data platform on IBM Cloud.
Businesses can reap the benefits of Big Data as a service solution on the cloud by leveraging IBM Message Hub (managed Kafka), IBM Streaming Analytics, IBM Analytics Engine (built on open-source Apache Hadoop and Apache Spark), and IBM Cloud Object Storage. Deploying a cloud solution provides flexibility and ease of use without the headaches of setup or high maintenance costs. Furthermore, data scientists can start providing value right away by accessing and analyzing data sets directly from Cloud Object Storage with IBM Data Science Experience.
Helping a mid-size company migrate to the cloud
A few months ago, the IBM Cloud Garage partnered with a mid-size company to assess and transform their entire application portfolio with the cloud. At the end of our initial assessment, we provided a transformation vision and implementation plan based on the IBM Cloud Garage Method. We also provided a target cloud architecture, implemented squad models, and created an actionable plan to divide projects into multiple Minimal Viable Products with a strangler pattern to modernize and migrate their applications to cloud.
In this article, I share insights from this engagement and target architecture components to help guide the implementation of a Big Data platform on the cloud for others.
Based on this engagement and several others, here are a few key requirements for establishing a secure cloud platform:
Having flexibility to scale up and down the infrastructure
Establishing a simplified and consolidated technology stack
Using cloud-native technologies
Ensuring governance is part of the technology stack
Using continuous delivery
Reducing IT costs
Minimizing vendor lock-in technology choices
Motivation to move Big Data stack to the cloud
While many companies today seek to harness Big Data to cultivate new business insights, this mid-size company’s use of Big Data is integral to their core mission and is baked into many of their business decisions. Big Data powers their innovation in customer service by anticipating what customers like and how they will interact. They then learn from these interactions to improve future experiences.
Their current Big Data platform was adequate but fairly expensive to maintain. It was also lagging behind current software and hardware technologies due to multiple acquisitions and several integrations. Their platform direly needed a technology stack upgrade and an update to processes. An update would enable faster innovation, provide data governance, reduce maintenance costs, and, most importantly, create a single source of truth to resolve data inconsistencies—inevitably increasing their technology’s business value.
Assessing the current application portfolio and drafting target deployment models
Before executing a cloud migration and embarking on digital transformation, an organization must understand their long-term business goals, pain points, archaeology of application and data infrastructure portfolio, and individual structure, operations, and processes. The IBM Cloud Garage performed a comprehensive review of the client’s applications supporting the core business functions, grouping them into different categories to evaluate against various cloud deployment models. As previously discussed, this client’s Big Data platform is one of their core components for all their applications. Moreover, in our assessment, we discovered their platform was built using a mix of various technologies over time, which presented a series of complexities to consider.
Target cloud architecture for the Big Data platform
When we joined the client, they had already started building a target architecture model that applied leading open-source technologies. In doing so, however, the client planned on implementing a roll-your-own technology stack to a Big Data platform on the cloud without leveraging any of the cloud-native services that allow for rapid provisioning (such as Hadoop and Spark clusters) or for flexibility for data at rest with the Object storage.
Noting their goal of embracing open-source technologies, our team proposed a new target architecture that would still meet their key requirements. Our proposed architecture was split into three major categories to address the data flows:
Batch jobs execution
First off, all three workflows required data ingestion. We proposed consolidating all data ingestions through Kafka to ingest large volumes of streaming data across different channels. Although their existing platform already used Kafka and other custom ingestion sources, we recommended consolidating multiple ingestions into Kafka and leveraging it as a service on IBM Cloud through IBM Message Hub. Kafka provides guaranteed delivery for streaming messages for real-time application. Since Kafka is an open-source technology with no vendor lock-in, the client would have flexibility in vendor choice.
Second, we proposed IBM Streaming Analytics on IBM Cloud to combine real-time and near real-time application use cases for message processing. IBM Streams is technology proven to scale for up to hundreds of millions of users. Although Streams is not open source, it supports the development of applications in Java, Scala, and Python. IBM Streams also supports Apache Beam, which is a programming model that allows users to create platform-agnostic streaming applications. Operators are the lowest-level Streams programming, which might be an adapter like read/write from Kafka and be able to apply a machine-learning model to make predictions.
In case of real-time/near real-time data ingestion, we recommended IBM Streaming Analytics to process the messages. Consumers of IBM Streams output will go into the downstream component—Druid, elastic stack, Cassandra, or HDFS, depending on the workflow. Additionally, the decision engine in the data pipeline will be triggered through the IBM Streams operator based on the type of event. The events data can be recorded into a Cassandra data store (deployed on a customer cloud account) for historical purposes and for real-time/near real-time decision making. The client was already using Druid as a column data store, and we kept it for OLAP queries, which can provide near real-time dashboarding capability for the interactive user applications needs.
The other important technology decision for the target architecture was centered on Hadoop clusters, which can be either deployed as an IaaS solution or as a cloud-native offering through IBM Analytics Engine. We recommended IBM Analytics engine to provide two key capabilities through a single solution: compute-as-a-service and object storage. A major benefit of IBM Analytics Engine is its independent scalability as opposed to being tightly coupled. This empowers a large number of the organization’s data scientists and their developers to focus on building new algorithms and models to extract insights from massive data sets without worrying about building a permanent cluster. IBM Analytics Engine simplifies the complexity of cluster deployment and provides a straightforward way to spin up Hadoop and Spark clusters on the cloud within minutes.
The role of data governance
Implementing data governance was one of the client’s key requirements in migrating their data platform to the cloud. Our team recommended IBM Data Catalog and Data Refinery, which are now part of the IBM Watson Knowledge Catalog. The Knowledge Catalog provides a broad range of capabilities and tools for data cataloging, discovery, findability, and governance. IBM Watson Knowledge Catalog includes built-in data discovery algorithms that utilize machine learning to auto-classify the contents of each data set and a governance policy manager and engine. When you add the data to the enterprise catalog, for example, sensitive data will automatically become classified. The Catalog also immediately redacts or masks any data that users are not allowed to see, aligning it to an organization’s data governance policies.
IBM Data Refinery is embedded into the IBM Watson Knowledge Catalog and Watson Studio, allowing data scientists to connect to data sources regardless of where the data resides, explore data, and use a wide range of transformations to cleanse and transform data into the format necessary for analysis.
IBM Data Science Experience (DSX), which is also now part of Watson Studio, provides a tooling that data scientists can focus on analyzing data. Creating and running models and spark jobs can be invoked through DSX.
Big data on cloud = no brainer
Implementing a Big Data platform stack on the cloud can provide flexibility, agility, and innovation for the enterprise. The role of IT infrastructure has changed from a cost center to one that is extremely flexible and innovative. By leveraging the cloud and a Big Data platform in the cloud, IT can become an innovation team-driving value.
Schedule your complimentary IBM Cloud Garage consultation to get started.
Amit Rai, Big Data Platform Architect, supported this Garage engagement, and he contributed to the article.