As enterprises shift from traditional analytics to AI-driven, agentic workflows, data integration becomes far more complex. Data must be prepared not for one platform, but for many intelligent consumers, including AI agents, copilots, decision services, automation bots and partner APIs. And it must operate across clouds, regions and organizational boundaries.
Centralizing everything in a single cloud is increasingly impractical due to data gravity, regulations, latency and cost, especially when agents need real-time, policy-compliant context close to where data is created or governed.
The new imperative is hybrid data integration: orchestrating trusted, governed processing and transformation where data lives so every agent gets the right data without excessive movement or duplication.
Against this backdrop, enterprises building on AWS, Microsoft Azure and Google Cloud are increasingly pairing cloud-native integration services with IBM watsonx.data® integration. This combination enables organizations to prepare data for multi-agent systems at scale: cloud-native where it makes sense, hybrid where it matters most.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
AWS, Microsoft Azure, Google Cloud and other hyperscalers provide strong, cloud-native data integration services (such as AWS Glue) that work exceptionally well inside their own systems. For cloud-first use cases, this approach reduces friction, especially when data originates and ends within AWS.
However, this strength is also a constraint. Cloud-native integration is, by design, not independent. Enterprises expand into hybrid and multi-platform environments. This process implies running analytics platforms, lakehouses, legacy systems and streaming workloads side by side. Soon, the cost and complexity of keeping everything inside one system begins to surface.
Portability becomes limited and governance must be stitched together across services. Scaling complex pipelines often requires deeper technical expertise and more support.
This aspect is not a shortcoming of hyperscalers. It is a reflection of its optimization for cloud-native patterns.
For enterprises in highly regulated industries, cloud-native integration alone quickly reaches its limits. As data sovereignty rules tighten and AI-driven decision systems demand fresh, trusted data in real time, centralizing everything in one cloud increases regulatory risk. And duplicating pipelines across regions drives up cost and complexity.
What begins as a cloud-first strategy often becomes brittle at scale. Chief information officers (CIOs) increasingly recognize that the issue isn’t cloud adoption, but cloud dependency. Sustainable AI requires an integration approach that processes and governs data where it originates, delivering only compliant, purpose-built datasets to cloud analytics and AI platforms.
IBM watsonx.data integration is purpose-built for hybrid, multi-cloud environments operating at global scale, where data must serve analytics, AI and intelligent agents. Today’s enterprises cannot be locked into a single system.
Unlike cloud-native tools optimized for a single platform, watsonx.data integration provides an open, enterprise-grade integration layer that works across AWS, Azure, Google Cloud, on-premises systems and modern lakehouse architectures.
As organizations deploy intelligent agents to support advanced analytics and AI use cases, the need for fresh, enriched data from operational systems, regional databases, event streams and cloud services intensifies. Centralizing all raw data in one location is not practical or economical.
IBM watsonx.data integration enables enterprises to process and enrich data where it resides. Through flexible deployment model and remote parallel engines, organizations can apply joins, transformations and business rules upstream before data moves across clouds or regions.
This approach reduces latency, minimizes unnecessary data movement and controls network and egress costs. The result is curated, agent-ready datasets delivered efficiently to analytics and AI platforms. This action improves cost predictability and long-term optimization.
Many AI-driven workflows operate in environments that handle sensitive or regulated data. IBM watsonx.data integration embeds governance by design, tightly integrating with watsonx.data intelligence to help ensure consistent policy enforcement. Built-in capabilities for data quality, lineage, observability and auditability allow enterprises to mask, tokenize and validate sensitive data before it is consumed.
Governance is no longer stitched together across services; it is enforced consistently across hybrid environments. By exposing only compliant, purpose-specific datasets, organizations reduce compliance risk while avoiding the cost and complexity of duplicating sensitive data across regions or platforms.
At the same time, enterprises increasingly rely on event-driven agents that respond to real-time operational signals (such as transactions, alerts or sensor data) alongside historical and reference datasets. Managing separate pipelines for batch and streaming workloads introduces inconsistency and operational overhead.
IBM watsonx.data integration unifies these patterns under a single integration framework. Supporting ETL, ELT, replication and real-time streaming, it helps ensure that the same transformation logic and business rules apply regardless of how or when data arrives. This process eliminates the need to assemble multiple tools for different workloads.
IBM watsonx.data integration accelerates delivery at scale by offering multiple pipeline authoring options, including no-code, low-code, SQL and AI-assisted development.
With agentic data integration, users can build pipelines with natural language, democratizing integration across technical and business teams. This approach reduces reliance on scarce specialists and empowers organizations to scale AI initiatives without skill bottlenecks.
Together, these capabilities enable enterprises to avoid vendor lock-in, maintain portability across execution engines and build agent-ready data architectures that balance cloud-native speed with enterprise-grade control. This method helps ensure intelligence scales responsibly, securely and predictably over time.
The path forward is a complementary, agent-ready strategy. For leaders, the implications are straightforward.
Predictability at scale: As AI agents and analytics pipelines grow more complex and run continuously, costs can escalate quietly. An enterprise-grade integration layer that supports hybrid deployment and efficient parallel processing helps keep performance stable and economics predictable over time.
Trust by design: When AI agents are making or influencing decisions, governance can’t be bolted on later. Consistent data quality, lineage and policy enforcement must be embedded directly into data flows. This approach allows agents to operate on trusted, compliant data across platforms, regions and clouds.
Speed through productivity: AI initiatives stall when integration depends on scarce technical skills. By combining low-code, no-code and AI-assisted integration, enterprises expand who can build and manage pipelines. This process accelerates data readiness for AI agents and turns experimentation into scale.
These leadership priorities are reflected in IBM internal benchmark testing (2025) comparing IBM DataStage Serverless PX Engine (part of watsonx.data integration) against AWS Glue 4.0 (Serverless).
DataStage showed nearly 2 times faster (48%) development time for the ETL pipeline when testing the two products’ minimum size configurations. In performance testing, AWS Glue required 16 times more compute to match the performance showed by DataStage. This information highlighted significant differences in compute efficiency.
IBM DataStage also demonstrated the potential to reduce total cost of ownership by more than USD 300,000 over 3 years. This data was compared to AWS Glue under the tested workload conditions (factoring in licensing, compute and effort costs savings due to lower compute use, reusability and faster builds).1
While results might vary by environment and workload, the findings underscore why performance efficiency, governance and productivity become decisive advantages as enterprises scale AI and hybrid data architectures.
The most effective enterprises on AWS, Azure or any hyperscale cloud are no longer debating tools—they are designing for coexistence. Cloud-native integration services handle intra-cloud speed and simplicity. IBM watsonx.data integration provides the enterprise layer that transforms, governs and optimizes data across hybrid and distributed environments.
This is not redundancy—it’s intent. Like a Swiss army knife, each capability is purpose-built and together they form a system that adapts as demands change.
As AI agents become first-class consumers of data that operate continuously across platforms at a global scale, this complementary approach becomes essential. It delivers resilience over rigidity, trust over shortcuts and predictability over surprise costs. It allows enterprises to modernize without disruption, protect long-standing investments and scale intelligence responsibly.
At this stage, data integration is no longer an implementation detail—it is a strategic posture. And choosing architectures that balance cloud-native speed with enterprise-grade control is more than good integration practice—it’s good leadership.
Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.
1 Based on internal IBM benchmarking conducted in 2025 under controlled test conditions comparing DataStage Serverless PX Engine in IBM Cloud Toronto and AWS Glue 4.0 Serverless Engine in US-East-1. Both platforms processed identical datasets derived from the National Highway Traffic Safety Administration (NHTSA) open data repository, which included structured vehicle, crash and defect data representative of real-world enterprise integration workloads. Each test used equivalent data models, transformation logic and optimized configurations for fair comparison.
Test scenarios included: (1) pipeline development and validation of the ETL data pipeline; (2) compute resource utilization and scaling efficiency during ETL job execution; and (3) overall total cost of ownership (TCO) impact. Data Stage demonstrated nearly 2 times faster (48%) ETL pipeline development time when testing the two products’ minimum size configurations. In performance testing, while average job runtime was approximately 149.67 seconds across both offerings, AWS Glue required 16 times more compute to match the performance showed by DataStage. The total cost of ownership (TCO) analysis highlights over USD 300,000 in potential cost savings over three years when using IBM DataStage (Serverless) compared to AWS Glue 4.0 (Serverless) under the tested conditions (factoring in licensing, compute and effort costs savings due to lower compute use, reusability and faster builds). The modelled workload assumed 5,000 pipeline designs per year and 1,000 job executions per day, with equivalent data models, transformation logic.
Performance results are based on specific configurations and workloads; results may vary depending on system setup, data volume, workload complexity and usage environment. IBM does not guarantee equivalent results in all customer scenarios. AWS Glue is a trademark of Amazon Web Services, Inc. NHTSA data used solely for internal performance testing; no endorsement or affiliation is implied.