When GCDO started its data and AI journey, the IBM Cloud Pak for Data solution didn’t exist. While the CEDP drove significant advancement, the development of the IBM Cloud Pak for Data solution gave GCDO a homefield advantage for taking its own platform to the next level.
As a suite of services and extensions that can be used as needed, the IBM Cloud Pak for Data solution gave GCDO the required flexibility to modernize in stages and start with the highest needs first. There was no prescriptive order to adoption or deployment.
GCDO first started using the AI suite of services within the IBM Cloud Pak for Data solution, including the IBM Watson® Studio solution. IBM Watson Studio technology runs on premises and in the cloud, analyzing data in the IBM Db2® Big SQL solution. The details of this part of GCDO’s modernization journey are described in this case study.
For the next step in the journey, GCDO turned to DataStage technology to dramatically increase the speed of ingesting vast amounts of data with stability and accuracy.
“After several months setting up servers, establishing database connections, and trial and error configuration and self-learning efforts, a 60 million record table would still take three days to replicate,” says Frank Duffy, Senior Project Manager with GCDO Master Data. “Looking at those statistics, with approximately 20 large tables to go, we were looking at another 60 days just to migrate the data.”
GCDO’s Data Movement team tested the performance of DataStage and Spark technology in executing common data load use cases. In more than 75% of the cases, they achieved better performance with DataStage technology than with Spark technology. For the remaining 25%, the results were a close match.
Beyond performance, factors that attracted GCDO to the DataStage solution include:
- Integration with the IBM Cloud Pak for Data ecosystem, specifically related to the IBM Watson Knowledge Catalog and data lineage
- Breadth of supported sources, targets and intermediate stages that met current and forward-looking needs
- Custom stages to encapsulate needs into reusable units when necessary
- Capabilities that supported a pattern-based approach
The IBM Cloud Pak for Data solution is aligned with several industry data sources and is constantly evolving those sources to meet new technology. The DataStage for IBM Cloud Pak for Data solution comes bundled with a large inventory of industry connectors, representing most of the data stores that GCDO users wanted to work with. These connectors meant that GCDO could work with these different storage formats and systems without needing to write any code.
In those instances where a connector wasn’t already available, custom connectors could be developed, deployed and dropped on to the canvas.
The DataStage for IBM Cloud Pak for Data solution also offers Runtime Column Propagation functionality, which appealed to GCDO engineers because it allowed a pattern-based approach to data movement. By expressing common data movement patterns as jobs, GCDO scaled up operations to support thousands of tables without needing to increase staffing.
“The DataStage for IBM Cloud Pak for Data pattern capability allowed us to have one job that could run thousands of ways,” says Rick McCall, GCDO Technical Lead for the Data Movement Tool. “In some cases, we had upwards of 8,000 jobs — pages and pages of them — that could be associated to a single pattern and run as a single job. That means one set of code, optimized performance, and source control all rolled into one super-fast, super-reliable solution.”
Another benefit of the DataStage for IBM Cloud Pak for Data solution is that it integrates seamlessly with RedHat® OpenShift®. It also offers API support so users can build custom workflows around it if needed.
“DataStage for IBM Cloud Pak for Data was a game changer for our data ingestion,” says Peter Herr, Global Leader for Client Master Data. “Our team had tried everything within the constraints of our existing system and were still at an impasse for acceptably accomplishing the massive amount of data migration we required. When Rick and team showed us the speed and power of DataStage, we were productive within weeks instead of months.”