Data integration use case

To cope with the influx of volumes and disparate data sources, enterprises need to build automation and intelligence into their data integration processes. Cloud Pak for Data provides the platform and tools to dynamically and intelligently orchestrate data across a distributed landscape to create a high-performance network of instantly available information for data consumers.

Watch this video to see the data fabric use case for implementing a Data integration solution in Cloud Pak for Data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Challenges

As their data types and volumes grow, enterprises face the following data integration challenges:

Ingesting data from across the enterprise
Processes need to be able to ingest data from any application or system regardless of whether the data resides on premises, in the cloud, or in a hybrid environment.
Integrating data from multiple sources
Data engineers must be able to combine data from multiple data sources into a single data set as a file or a virtual table.
Making the data available for users
Data engineers need to be able to publish each integrated data set to a single catalog, and all users who need to consume the data need to have self-service access to it.

You can solve these challenges and integrate your data by using Cloud Pak for Data.

Example: Golden Bank's challenges

Follow the story of Golden Bank as the data engineering team implements Data integration. Golden Bank has a large amount of customer and mortgage data that is stored in three external data sources. Lenders use this information to help them decide whether they should approve or deny mortgage applications. The bank wants to integrate the data from the different sources, and then deliver that transformed data to a single output file that can be shared.

Process

To implement a Data integration solution for your enterprise, your organization can follow this process:

  1. Integrate the data
  2. Share the data
  3. Automate the data lifecycle

The DataStage, Watson Query, Watson Pipelines, Data Replication, IBM Knowledge Catalog, and Databand services in Cloud Pak for Data provide all of the tools and processes that your organization needs to implement a Data integration solution.

Image showing the flow of the Data integration use case

1. Integrate the data

With a data fabric architecture that uses Cloud Pak for Data, data engineers can optimize data integration by using workloads and data policies to efficiently access and work with data and combine virtualized data from different sources, types, and clouds as if the data was from a single data source. In this step of the process, the raw data is extracted, ingested, virtualized, and transformed into consumable, high-quality data that is ready to be explored and then orchestrated in your AI lifecycle.

What you can use What you can do Best to use when
Watson Query Query many data sources as one. Data engineers can create virtual data tables that can combine, join, or filter data from various relational data sources.

Data engineers can then make the resulting combined data available as data assets in catalogs. For example, you can use the combined data to feed dashboards, notebooks, and flows so that the data can be explored.
You need to combine data from multiple sources to generate views and make combined data available as data assets in a catalog.
DataStage Data engineers can design and run complex data flows that move and transform data. You need to design and run complex data flows. The flows must handle large volumes of data and connect to a wide range of data sources, integrate and transform data, and deliver it to your target system in batch or real time.
Databand Track the execution and proactively identify problems with the health of your DataStage jobs. You want your data team to be notified on the condition of your jobs and the quality of your inputs and outputs.
Data Refinery Access and refine data from diverse data source connections.

Materialize the resulting data sets as snapshots in time that might combine, join, filter, or mask data to make it usable for data scientists to analyze and explore.

Make the resulting data sets available in catalogs.
You need to visualize the data when you want to shape or cleanse it.

You want to simplify the process of preparing large amounts of raw data for analysis.
Data replication Distribute a data integration workload across multiple sites.

Provide continuous availability of data.
Your data is distributed across multiple sites.

You need your data to be continuously available.

Example: Golden Bank's data integration

Risk analysts at Golden Bank calculate the daily interest rate that they recommend offering to borrowers for each credit score range. Data engineers use DataStage to aggregate anonymized mortgage application data with the personally identifiable information from mortgage applicants. DataStage integrates this information, including credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. The data engineers then load the data into a target output .csv file that can be published to a catalog and shared for use by lenders and analysts.


2. Share the data

The catalog helps your teams understand your customer data and makes the right data available for the right use. Data scientists and other types of users can help themselves to the integrated data that they need while they remain compliant with corporate access and data protection policies. They can add data assets from a catalog into a project, where they collaborate to prepare, analyze, and model the data.

What you can use What you can do Best to use when
Catalogs Use catalogs in IBM Knowledge Catalog to organize your assets to share among the collaborators in your organization.

Take advantage of AI-powered semantic search and recommendations to help users find what they need.
Your users need to easily understand, collaborate, enrich, and access the high-quality data.

You want to increase visibility of data and collaboration between business users.

You need users to view, access, manipulate, and analyze data without understanding its physical format or location, and without having to move or copy it.

You want users to enhance assets by rating and reviewing them.

Example: Golden Bank's catalog

The governance team leader at Golden Bank creates a catalog, "Mortgage Approval Catalog," and adds the data stewards and data scientists as catalog collaborators. The data stewards publish the data assets that they created into the catalog. The data scientists find the data assets, curated by the data stewards, in the catalog and copy those assets to a project. In their project, the data scientists can refine the data to prepare it for training a model.


Automate the data lifecycle

Your team can automate and simplify the data lifecycle with Watson Pipelines.

What you can use What you can do Best to use when
Watson Pipelines Use pipelines to create repeatable and scheduled flows that automate your data ingestion and integration. You want to automate some or all of the steps in a data integration flow.

Example: Golden Bank's automated data lifecycle

The data scientists at Golden Bank can use pipelines to automate their data integration lifecycle to keep the data current.


Learn more

Parent topic: Data fabric use case overview