Data integration use case
To cope with the influx of volumes and disparate data sources, enterprises need to build automation and intelligence into their data integration processes. Cloud Pak for Data provides the platform and tools to dynamically and intelligently orchestrate data across a distributed landscape to create a high-performance network of instantly available information for data consumers.
Watch this video to see the data fabric use case for implementing a Data integration solution in Cloud Pak for Data.
This video provides a visual method to learn the concepts and tasks in this documentation.
Challenges
As their data types and volumes grow, enterprises face the following data integration challenges:
- Ingesting data from across the enterprise
- Processes need to be able to ingest data from any application or system regardless of whether the data resides on premises, in the cloud, or in a hybrid environment.
- Integrating data from multiple sources
- Data engineers must be able to combine data from multiple data sources into a single data set as a file or a virtual table.
- Making the data available for users
- Data engineers need to be able to publish each integrated data set to a single catalog, and all users who need to consume the data need to have self-service access to it.
You can solve these challenges and integrate your data by using Cloud Pak for Data.
Example: Golden Bank's challenges
Follow the story of Golden Bank as the data engineering team implements Data integration. Golden Bank has a large amount of customer and mortgage data that is stored in three external data sources. Lenders use this information to help them
decide whether they should approve or deny mortgage applications. The bank wants to integrate the data from the different sources, and then deliver that transformed data to a single output file that can be shared.
Process
To implement a Data integration solution for your enterprise, your organization can follow this process:
The DataStage, Watson Query, Watson Pipelines, Data Replication, IBM Knowledge Catalog, and Databand services in Cloud Pak for Data provide all of the tools and processes that your organization needs to implement a Data integration solution.
1. Integrate the data
With a data fabric architecture that uses Cloud Pak for Data, data engineers can optimize data integration by using workloads and data policies to efficiently access and work with data and combine virtualized data from different sources, types, and clouds as if the data was from a single data source. In this step of the process, the raw data is extracted, ingested, virtualized, and transformed into consumable, high-quality data that is ready to be explored and then orchestrated in your AI lifecycle.
What you can use | What you can do | Best to use when |
---|---|---|
Watson Query | Query many data sources as one. Data engineers can create virtual data tables that can combine, join, or filter data from various relational data sources. Data engineers can then make the resulting combined data available as data assets in catalogs. For example, you can use the combined data to feed dashboards, notebooks, and flows so that the data can be explored. |
You need to combine data from multiple sources to generate views and make combined data available as data assets in a catalog. |
DataStage | Data engineers can design and run complex data flows that move and transform data. | You need to design and run complex data flows. The flows must handle large volumes of data and connect to a wide range of data sources, integrate and transform data, and deliver it to your target system in batch or real time. |
Databand | Track the execution and proactively identify problems with the health of your DataStage jobs. | You want your data team to be notified on the condition of your jobs and the quality of your inputs and outputs. |
Data Refinery | Access and refine data from diverse data source connections. Materialize the resulting data sets as snapshots in time that might combine, join, filter, or mask data to make it usable for data scientists to analyze and explore. Make the resulting data sets available in catalogs. |
You need to visualize the data when you want to shape or cleanse it. You want to simplify the process of preparing large amounts of raw data for analysis. |
Data replication | Distribute a data integration workload across multiple sites. Provide continuous availability of data. |
Your data is distributed across multiple sites. You need your data to be continuously available. |
Example: Golden Bank's data integration
Risk analysts at Golden Bank calculate the daily interest rate that they recommend offering to borrowers for each credit score range. Data engineers use DataStage to aggregate anonymized mortgage application data with the personally identifiable information from mortgage applicants. DataStage integrates this information, including credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. The data engineers then load the data into a target output .csv file that can be published to a catalog and shared for use by lenders and analysts.
Automate the data lifecycle
Your team can automate and simplify the data lifecycle with Watson Pipelines.
What you can use | What you can do | Best to use when |
---|---|---|
Watson Pipelines | Use pipelines to create repeatable and scheduled flows that automate your data ingestion and integration. | You want to automate some or all of the steps in a data integration flow. |
Example: Golden Bank's automated data lifecycle
The data scientists at Golden Bank can use pipelines to automate their data integration lifecycle to keep the data current.
Learn more
- Use case tutorials
- DataStage overview
- DataStage observability with Databand
- Watson Query overview
- IBM Knowledge Catalog overview
- Watson Pipelines
- Videos
Parent topic: Data fabric use case overview