What is data integration?

Data integration is a group of technical and business processes — such as ETL, data replication and data virtualization — that combine data from disparate sources into a meaningful and valuable data set for business intelligence and business analytics. A complete data integration solution delivers data from multiple on-premises and cloud sources to support a business-ready trusted data pipeline for DataOps.

Data integration solutions from IBM — including data integration on the IBM Cloud Pak® for Data platform — offer scalable, multicloud solutions to accelerate your journey to AI. Extract large volumes of data from source systems, transform it in any style and load it to an enterprise data warehouse or cloud sources.

IBM data integration products can also be used stand-alone or as managed services on IBM Cloud®.

See why IBM was recognized as a Leader in the 2021 Gartner Magic Quadrant for Data Integration Tools.

Andre De Locht explains data integration

Data Decoded in 30 Seconds: What is Data Integration? (00:30)

IBM ranks second highest in the Data Fabric Use Case

See why in the 2021 Gartner Critical Capabilities for Data Integration Tools.

Data integration use cases

Customer data integration

outline of a person inside a hexagon

Connect data from distributed databases and systems to boost customer relationship management (CRM) and deliver what customers want or need.

Healthcare data integration

medical chart clipboard

Combine clinical, genomic, radiology and image data for rapid insights and make it available for patient treatment, cohort treatment and population health analytics.

Big data integration

nested hexagons

Use sophisticated data warehouses that deliver a unified view of big data from numerous sources to simplify business intelligence processes.

Why IBM for data integration solutions

Open source platform

Gain enterprise scale and security with a data integration platform running on Red Hat® OpenShift®.

AI-powered automation

Accelerate delivery and reduce TCO with AI-powered automation of tasks.

Multicloud deployment

Leverage container technology to run data integration across hybrid and multicloud environments.

IBM DataStage

A leader in ETL, IBM® DataStage® is a highly scalable data integration tool for designing, developing and running jobs that move and transform data on premises and in the cloud. 

With a modern container-based architecture on Red Hat OpenShift, IBM DataStage for IBM Cloud Pak for Data combines this industry-leading data integration with DataOps, governance and analytics on a single data and AI platform. Deliver trusted data at scale across hybrid or multicloud environments.

Diver deeper

Data integration techniques

Data integration is critical to helping companies consolidate data into a single, trusted view for analysis and ultimately, to drive business. For example, a unified view of customer data can fuel more successful marketing strategies. Different techniques are used in the data integration process, including:

  • Extract, transform, load (ETL): Extract, transform and load data from multiple sources into a single data store that is then loaded into a data warehouse or other target system. Transforming — or cleansing and preparing — the raw data in a staging area instead of the source system improves performance and reduces the chance of data corruption.
  • Extract, load, transform (ELT): Extract and load raw data from source locations into the target data store, where it can then be transformed when needed. Often, the target system for ELT is a data lake, which can house massive amounts of structured and unstructured data, or a cloud data warehouse. This method is ideal for supporting artificial intelligence (AI), machine learning, predictive analytics and applications that use real-time data.
  • Data replication: Deliver complementary features, such as near-real time data synchronization or distribution using low-impact, log-based data capture.
  • Data virtualization: Abstract data access from multiple sources by creating a virtual view for business users who need to access and query data on demand.

Data integration challenges

Many organizations are facing an avalanche of data originating from different systems, such as relational databases or streaming data services. Business intelligence  required for better decision-making is hidden within all of that data, but solid data integration processes must be followed to make sure that the data is managed and governed, and ultimately trusted. Your integration efforts may be hindered by:

Data latency in multicloud environments
Moving data volumes across multicloud and data lake environments can be slow and prevent you from using that data in real time within your applications or operational systems.

The complexity and cost of multiple tools
Managing multiple data integration tools is time consuming for your resources and can be expensive for your business.

Manual processes and workflows
Manual tasks such as hand coding and job design can delay application building and updates. Manual processes also must be designed for each cloud environment, so if you are working with multiple clouds, this increases development time and costs.

Lack of data quality and governance
Data originating from so many different sources can be difficult to govern and can put your business at risk. Trusted and clean data is also required for effective AI models.

Cloud data integration

Data repositories include on-premises, cloud and data lake environments. Often, organizations are also using clouds from different vendors to meet specific needs for storage or application deployment. The practice of integrating data across all of these environments for a unified view is cloud data integration.

The complexity of cloud data integration requires a modernized approach. A robust multicloud data integration solution should:

  • Simplify and accelerate synchronization of diverse data sources across hybrid multicloud environments
  • Locate run times closer to data sources
  • Use embedded analytics and AI services on different cloud platforms
  • Automate job design and feature pre-built connectors for faster access to data sources
  • Include in-line data quality to manage governance and compliance

IBM DataStage for IBM Cloud Pak for Data can deliver this modernized approach.

Data integration versus application integration

Data integration and application integration may seem similar, but in fact, the concepts are very different. As described, data integration is the practice of locating and retrieving information from disparate data sources and delivering it in a unified structure and view. Application integration directly links multiple independent applications so that they can work with each other, often through modern APIs or traditional service-oriented architectures. Data and workflows are merged and optimized, helping bridge the gap between on-premises systems and cloud-based applications.

Data integration versus data migration

Data migration is simply the process of transferring data between storage types. This can include moving data from on-premises environments to the cloud. Data integration, however, is more complex as the data goes through the ETL or ELT process to be made ready for analysis.

Related products

IBM Cloud Pak for Data

Integrate all of your data, whether on premises or on any cloud, to keep it more secure at its source with this flexible multicloud data platform.

IBM InfoSphere Master Data Management

Master data management for single or multiple domains, including customers, suppliers, products, accounts and more.

IBM InfoSphere Data Replication

Help replicate data across a wide range of RDBMS and non-RDBMS sources and targets with low latency while improving transactional integrity.

Next steps