As its name suggests, agentic AI data engineering is the fusion of data engineering and agentic AI. The former is the practice of developing and maintaining data infrastructure and data pipelines integral to data management.
The latter refers to artificial intelligence systems that can accomplish specific goals with limited human oversight. In a multiagent system framework, subtasks performed by multiple AI agents—machine learning models that mimic human decision-making—are coordinated through AI orchestration.
In data engineering, AI agents can perform multi-step problem-solving processes central to ensuring high-quality data is available for enterprise use cases. These processes include designing data pipelines and executing critical data processing tasks, such as carrying out data transformations and detecting data problems.
Also known as agentic data engineering, agentic AI data engineering can significantly reduce the workloads of data engineering teams while also optimizing the performance of data pipelines. In addition, agentic AI data engineering can empower business users to access and derive insights from enterprise data even if they lack technical skills.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
To understand why agentic AI systems are being adopted for data engineering, it’s helpful to take a closer look at the nature of modern data engineering.
Data engineering is critical for enterprises seeking to unlock value from increasingly vast and complex data ecosystems. Data engineers help structure and ensure the functionality of the workflows that convert raw data into outputs that provide real-world business value. When executed successfully, data engineering results in the delivery of clean, accurate and timely datasets that can be analyzed to yield actionable insights or used to fuel AI initiatives.
As organizations accelerate their reliance on data-driven decision-making, including time-sensitive decision-making based on real-time data, the need for reliable data pipelines has never been greater. But the challenges of maintaining such pipelines have also never been greater—data engineers are now tasked with overseeing ever-more complex data stacks and orchestration processes.
Inevitably, that means data teams spend much of their time on “firefighting.” In other words, they concentrate on maintenance and troubleshooting to address data pipeline problems and, worse, data pipeline failures.
“When data engineering teams are building pipelines, the engineers often depend on a mix of scheduled jobs, stored procedures, complicated scripts, as well as transformation logic. And each of these works together just to keep the data flowing. Sometimes when a single schema change or column rename happens on a source system, this can trigger hours of debugging and retesting,” Justin Yan, a senior product manager for IBM Data & AI, explained in an IBM Technology video.
Fortunately, AI agents can now be deployed to handle much of this work—and to prevent issues from arising in the first place. Intelligent agents can “solve problems in data integration, helping to plan, monitor and adapt to data challenges so data arrives where it needs to be with the quality and timeliness that your workloads require,” Yan said.
A combination of technologies supports the deployment of agentic AI for data engineering.
An AI agent is a system that autonomously performs tasks by designing workflows with available tools—including data workflows. Agents use the natural language processing techniques of large language models to understand and respond to user inputs in a step-by-step fashion and to determine when to call on external tools.
Natural language processing (NLP) is a subfield of computer science and AI that uses machine learning to enable computers to understand and communicate with human language. NLP plays a growing role in enterprise solutions that help streamline and automate business operations.
Machine learning is the subset of AI focused on algorithms that can “learn” the patterns of training data. Such algorithms then use that pattern recognition to make accurate inferences about new data. Machine learning provides the backbone of most modern AI systems, including large language models and other generative AI tools.
Large language models (LLMs) are a type of deep learning model that are capable of understanding and generating natural language and other types of content to perform a multitude of tasks. Their capabilities stem from natural language processing techniques and training on massive amounts of data that helps them handle unstructured human language at scale.
While the use of autonomous agents for data engineering can vary by data system and engineering team, here’s an overview of how AI-powered systems can handle different data engineering processes and tasks across a data lifecycle.
Agentic AI data engineering enables organizations to automate the creation of data pipelines. Users can declare their intent regarding what a pipeline delivers using natural language without delineating the steps necessary to achieve the desired results—it’s up to the AI agent to determine how the pipeline will work. This is known as declarative pipeline authoring and it’s an alternative to the more hands-on approach of coding each pipeline step.
After a user submits a natural language request, LLMs parse the request and understand the user’s intent. Then, an AI agent designs and often implements an end-to-end process that includes:
Users with more technical knowledge may choose to specify the structure of their requested data pipeline. They can do so by using a Python software development kit (SDK) that enables LLMs to write and execute Python scripts based on user requests for various data-related tasks, such as selecting a data source or engaging in data cleaning.
Once the pipeline is designed, an agentic AI system can execute workloads. AI agents engage in tool calling to interact with external tools, application programming interfaces (APIs) or systems necessary for connecting to data sources, understanding metadata and carrying out transformations.
Agents also select the optimal execution path for data workflows across hybrid environments. This includes dynamically choosing the best integration approaches (real-time streaming, batch ETL/ELT or replication) and runtime environments (on-premises, in a cloud environment or through pushdown and remote engines) for each part of the job.
Reinforcement learning can help agents improve pipeline plans over time by rewarding correctly configured and completed pipeline runs.
An agentic system can enable observability by continuously monitoring pipelines. Agents can detect schema drift, data anomalies and data quality issues. They can also support root cause analysis for pipeline problems, recommend remediation steps and execute those steps.
The autonomous execution of pipeline fixes can be especially helpful at otherwise inconvenient times. “What if a nightly job fails? Instead of paging someone, the agent can retry the runs, scale up engines and adjust the flow logic automatically,” IBM Product Manager John Wen explained in an IBM Technology video.
Agentic AI data engineering provides a host of benefits to organizations, their data teams and their business users. These include:
A fundamental challenge facing data engineers today is wrangling data across complex and siloed environments: different clouds, data warehouses, data lakes, on-premises servers and more. Some data is organized in spreadsheets and SQL databases, but much of it is unstructured in documents, emails, transcripts and images. In an enterprise system, AI agents can connect to an array of data sources and integrate various data formats, creating unified data platforms that enable richer analytics and more accurate forecasting.
AI agents can automate data profiling, data validation, rule creation, monitoring and remediation. “The agents would be able to detect column changes or type mismatches early and propose fixes before jobs fail. Continuous checks for anomalies, automatic backfills and rerouting around failed data sources will help keep data trustworthy for downstream uses in AI systems,” Yan explained.
AI agents can evaluate different execution strategies and identify potential bottlenecks and complications, such as hidden dependencies in different application stacks. By factoring this information into pipeline design, they can devise plans that minimize resource consumption and operational time while still achieving data goals.
In addition, as infrastructure or schemas change, agentic systems can adapt and reuse existing pipelines, helping enterprises avoid accumulating obsolete pipelines and technical debt.
Pipeline design and continuous monitoring by AI agents can ensure that sensitive data is in compliance with data privacy laws such as the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) and the European Union’s General Data Protection Regulation (GDPR). In addition, lineage tracking by AI agents can support transparency and auditability.
Business users with minimal or no technical expertise no longer have to rely exclusively on data professionals to help them meet their data needs. They can request the creation or delivery of datasets from AI agents rather than waiting for assistance from a data practitioner, helping them achieve key insights faster.
AI agents can design, build and execute fully functioning data pipelines in a fraction of the time it would take data teams to manually code such pipelines. AI agents can also make these pipelines adaptable and “self-healing”—that is, they can monitor and address issues before they disrupt downstream processes. Altogether, this means enterprises can confidently continue adding pipelines as their data estates and data needs grow and evolve.
By offloading pipeline design, maintenance and troubleshooting tasks to agentic AI systems, data engineers can enhance their productivity and gain more bandwidth to pursue high-value tasks and meaningful work, such as building and piloting new capabilities.
As with other AI use cases, enterprises should consider several potential challenges as they seek to deploy agentic AI for data engineering.
Software solutions and platforms can help enterprises address the challenges of incorporating agentic AI, including AI-driven systems for data engineering, into everyday workflows.
Robust AI governance tools enable the embedding of guardrails to limit unintended agent behaviors and the deployment of specialized metrics for evaluating agent performance. AI orchestration solutions can help bridge gaps between advanced AI technologies and older enterprise systems without protracted reengineering.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Successfully scale AI with the right strategy, data, security and governance in place.