Data integration use case

To cope with the influx of disparate data sources, enterprises need to leverage different data integration styles. Watsonx.data integration provides tools to dynamically integrate and observe data across a distributed landscape to create a high-performance network of instantly available information for analytics, business intelligence, and AI purposes.

Challenges

As their data types and volumes grow, enterprises face the following data integration challenges:

Ingesting data from across the enterprise
Processes must ingest data from any application or system regardless of whether the data resides on premises, in the cloud, or in a hybrid environment.
Integrating data from multiple sources
Data engineers must combine data from multiple data sources into a single data set.
Developing data flows for diverse data integration styles
Data engineers must develop data flows that address a range of diverse data integration styles, including streaming, replication, and bulk or batch processing.
Adjusting to constant shifts in data architectures
Data engineers must constantly adjust to shifts in data architectures, which can cause significant rework costs and reduced business responsiveness.
Responding immediately to data incidents
Data engineers must observe end-to-end data movement and immediately respond to any problems or issues that occur in data quality, integrity, and access.

You can integrate your data and solve these challenges by using watsonx.data integration.

Example: Golden Bank's challenges

Follow the story of Golden Bank as the data engineering team implements diverse data integration styles:

  • Transform batch data: Golden Bank has a large amount of customer and mortgage data that is stored in three external data sources. Lenders use this information to help them decide whether they should approve or deny mortgage applications. The bank wants to integrate the data from the different sources, and then deliver that transformed data to a single output file.

  • Replicate data: Golden Bank needs to replicate the credit score information from a database owned by an external provider into Golden Bank's Apache Kafka cluster. The bank sets up a near real time and continuous replication feed with efficient data capture from the source database into the target Kafka cluster.

  • Stream real-time data: Golden Bank needs to enrich the replicated credit score information in real time, as soon as new data arrives in Golden Bank's Kafka cluster. The bank wants to ingest the data, mask sensitive information, and look up additional information about the applicant.

  • Observe the data: Golden Bank needs to make sure that all of their integration processes finish successfully. To ensure that they catch and resolve errors before data is negatively impacted, the data engineering team sets up Job run state alerts. By setting alerts, the team is notified as soon as an error occurs. The team can quickly fix the underlying issue and focus their time on the task at hand, instead of constantly watching for errors.


Process

To implement a data integration solution for your enterprise, your organization can follow this process:

  1. Integrate the data
  2. Observe the data

1. Integrate the data

With watsonx.data integration, data engineers can efficiently access and work with data from different sources, types, and clouds as if the data was from a single data source. In this step of the process, raw data is extracted, ingested, and transformed into consumable, high-quality data that is ready to be explored for analytics, business intelligence, and AI purposes.

What you can use What you can do Best to use when
DataStage Design and run complex ETL data flows that move and transform batch data. You need to design and run batch data flows. The flows must handle large volumes of data and connect to a wide range of data sources, integrate and perform complex transformations on the data, and deliver it to your target system.
Data Replication Distribute a data integration workload across multiple sites.

Provide continuous availability of data.
Your data is distributed across multiple sites.

You need your data to be continuously available.
StreamSets Design and run streaming data flows that read data as soon as it becomes available and that perform light in-flight transformations on the data. You expect the data to arrive continuously in the source system.

You need to process the data as soon as it becomes available.

Example: Golden Bank's data integration

Risk analysts at Golden Bank calculate the daily interest rate that they recommend offering to borrowers for each credit score range.

Data engineers use DataStage to aggregate anonymized mortgage application data with the personally identifiable information from mortgage applicants. A DataStage flow integrates this information, including credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. The data engineers then load the data into a target output .csv file that can be shared for use by lenders and analysts.

Data engineers use Data Replication to replicate credit score information from a database owned by an external provider into Golden Bank's Kafka cluster.

Data engineers use StreamSets to ingest the credit score information as soon as it is replicated into Golden Bank's Kafka cluster. A StreamSets flow uses light in-flight transformations to enrich the flow, including masking sensitive information and looking up additional information about the applicant including whether the applicant is an existing customer or has other active loans. The data engineers write the enriched data to a target output .csv file that can be shared for use by lenders and analysts.


2. Observe the data

Data engineers use Data Observability to track the data health of the end-to-end data integration process. The Data Observability system scans projects in real-time and reports on collected metadata from any jobs that were run and are observed by data engineers. With the collected metadata, data engineers can use the alerting system to notify the data team about the health of jobs and the quality of data inputs and outputs.

What you can use What you can do Best to use when
Data Observability Create an alert to observe the data throughout the end-to-end integration process. You want your data to be continuously observed for issues, and you need to be notified about these issues as soon as they occur.

Example: Golden Bank's data observability

Data engineers at Golden Bank create alerts to observe the DataStage flow. They create a Job run state alert to get notified when the state of the job run changes to Failed. If it does, data engineers can investigate whether there is an issue with one of the data sources, with the amount of allocated resources, or any other problem.