Think 2026 Build, govern and scale agentic AI | Think keynotes

What is the agent development lifecycle (ADLC)?

The agent development lifecycle, explained

The agent development lifecycle (ADLC) is a structured, scalable end-to-end methodology for building and managing enterprise AI agents. ADLC guidelines, guardrails and specifications enable reliable agentic systems that conform to common standards, facilitating interoperability while reducing cost, risk and operational burden.

The power and proliferation of AI agents—software systems that use large language models (LLMs) as a decision engine for autonomously planning and executing tasks necessary to achieve a prescribed goal—has precipitated a rapid transformation of enterprise workflows. The speed of that transformation has outpaced many organizations’ ability to adapt traditional IT structures to reflect the unique demands of agentic AI integration, yielding a fragmented ecosystem. The ADLC introduces common specifications and shared practices to facilitate reliable, agentic systems across different tools, platforms, vendors and enterprise environments.

Many of today’s standard IT processes evolved in the context of traditional software development and are tailored to the assumptions of static, deterministic systems. Such processes are often ill-suited to the dynamic, probabilistic nature of the LLMs that drive agent behavior: They’re called “AI agents” because they quite literally have agency to determine how to execute tasks. Shared norms and specifications that account for this shift can significantly reduce associated risks and expedite responsible agentic AI adoption.

 For agentic AI to scale sustainably and effectively, AI agents must integrate predictably across different models, platforms, vendors and industry ecosystems. At present, nearly every platform for building AI agents has its own format for agent definition, tool and function-calling schema, memory and state management model, test suite, deployment protocols and versioning system. This fragmentation hinders interoperability, increasing switching costs and vendor lock-in, which recent research found to be a primary concern—second only to security—of business and technical leaders navigating the AI agent vendor ecosystem.1 Operationally, that fragmentation also reduces the potential for transferable skills and workflows.

While standardized norms and practices can mitigate these inefficiencies, it’s important for organizations to embrace and enforce structural protocols that work with, not against, developers’ established tendencies and preferences. The ADLC therefore aims to translate emerging developer practices into best-in-class agent experiences.  

The ADLC integrates core DevSecOps principles to map AI agent development onto a series interconnected and largely interdependent phases. The purpose and practices of each phase, as well as their relations to one another, are explored later in this article. Full details, suggestions and specifications are provided in the official IBM guide to the ADLC.

ADLC vs. SDLC

Some of the standard assumptions and best practices of the traditional software development lifecycle (SDLC) are ill-suited to building AI agents. For enterprise AI agent initiatives to succeed, organizations must understand and account for the fundamental differences between traditional software and agentic systems. 

  • Deterministic vs. probabilistic: Traditional software is provided explicit, deterministic instructions (in the form of imperative code) in pursuit of an end goal that is only implied; AI agents are provided an explicit end goal and behavioral guardrails, then tasked with using tools, data sources and autonomous reasoning to infer the best way to achieve it. In an agentic system, providing the same input twice might yield two different outputs.

  • Static vs. adaptive: Traditional software has fixed functionality, with behavior that changes only if the code governing that behavior is actively changed. An agent’s behavior may evolve based on feedback from its environment.

  • Code-driven vs. outcome-driven: The deterministic, linear nature of traditional software enables developers to predict a program’s success in terms of stable and (relatively) objective measures of code quality. The probabilistic nature of agentic AI means that optimal implementation might yield suboptimal agent performance and messy, suboptimal prompts might nevertheless yield accurate outputs. Evaluation of agentic systems therefore requires systematic measurement of business outcomes and agent behavior over time.

Perhaps most crucially, agentic systems and traditional software have very different failure modes.

Traditional software fails due to logic errors or edge cases that “break” the rigid instructions of the software’s code. These failures are generally obvious: the software crashes or yields nonsensical outputs. Because traditional software is deterministic, any failure can be traced back to a specific defect in code (which you can then debug).

Agentic systems, conversely, typically fail through hallucinations or problems with alignment. AI agents operate by probabilistically interpreting intent (provided through system prompts, guardrails and context), rather than by executing the strict rule-based logic of traditional software. An agent might ostensibly “solve” a problem by violating constraints or confidently providing an incorrect result. Such failures are easier to miss: a plausible but false output is harder to spot than a system crash. They’re also harder to trace: the overall failure of a complex multi-step agentic workflow might stem from the incorrect result of a single probabilistic tool call, and the offending error might not be reproduced upon subsequent evaluation.

The ADLC therefore builds observability, containment and ongoing evaluation into each phases. Agentic development must efficiently balance the need for thorough testing in real-world scenarios with the need to contain real-world risk.

Phases of the ADLC

The agent development lifecycle (ADLC) maps the process of building, deploying, optimizing and managing AI agents into distinct phases, some of which combine to form iterative loops.

  • Plan: Engage all relevant stakeholders to align on use cases, goals, success metrics and ideal business outcomes (to inform an evaluation framework). Establish and document desired agent behavior and standard operating procedures in natural language.

  • Code & Build: Develop agents (which entails, among other things, model selection, prompt design and orchestration). Identify relevant external services—such as tools, databases and APIs—and integrate them into an enterprise layer using the model context protocol (MCP). Enforce thoughtful version control, sandboxing and gateway patterns.

  • Test & Release: Run structured evals against predefined benchmarks, enforce policy checks, perform security testing and red teaming exercises and certify agents are in a governed catalog. Iteratively repeat Code & Build and Test & Release in a loop as necessary.

  • Deploy: Once certified, move agents into production environments, rolling out in progressive stages to manage risk. Adopt a gateway pattern to enable effective governance and policy enforcement. Ensure runtime governance through sandboxing, versioning, rollback strategies, security enforcement and performance throttling.

  • Operate: Continuously observe and optimize deployed agents, tracking real-time metrics (such as accuracy, latency, cost and user satisfaction) and remaining alert for model drift or performance regressions. Use these feedback loops to optimize prompts, tools, models and memory policies for performance and security, iteratively repeating the Deploy-Operate loop as necessary.

  • Monitor: Once the system has been fully validated and optimized, continue to monitor and conduct ongoing audits for fairness, transparency and regulatory compliance. Maintain a well-governed catalog of agents and tools to facilitate observability and reproducibility.

By adhering to these phases and the priorities they’re designed to address—each of which are explored in greater detail in the following sections—organizations can scale agents safely and confidently by keeping them trustworthy, auditable and aligned to business value.

Plan

The agent development process begins with use case alignment, from which all other planning considerations unfold.

The specific business outcomes your agents are to achieve will determine the key performance indicators (KPIs) and other success metrics that will be used to evaluate agent performance. Customer support automation might be evaluated primarily in terms of end user satisfaction and cost reduction, whereas a coding agent might be evaluated on latency and code quality. Selecting the specific mathematical formulation for these metrics is a critical architectural decision unto itself, as different calculation methods can yield different success signals and operational incentives.

Specific agent-driven business outcomes call for specific processes, tasks and subtasks to be automated by agents, and many of those tasks will require AI agents to be provisioned with access to specific tools, datasets, knowledge bases and APIs. Compiling (and procuring) a list of all necessary resources in advance of the code and build phase is essential to an efficient and effective development process.

That said, the most important decision to be made in the planning phase is whether you should build an AI agent at all.

When to build an AI agent

IBM recommends finding the simplest solution that can address your specific business need. If a problem can be solved with traditional automation, retrieval systems or thoughtful prompting, agentic AI might introduce unnecessary complications. For instance, a system for automating responses to customer emails would require an AI agent, but a system for classifying emails needs only an LLM and a well-designed prompt.

Assuming the improved performance agentic AI can provide over simpler solutions for your use case warrants the associated increases in cost and latency, an ideal implementation of agentic AI in enterprise scenarios typically entails:

  • Well-defined product scope. Prioritize specific business problems requiring judgment or multistep reasoning that can’t be adequately achieved with broad rule-based automation. Problems requiring contextual judgment, multistep reasoning and complex decision-making at scale are ideal candidates for agentic AI.

  • Clear success metrics. Agentic AI succeeds in scenarios where success can be objectively, quantitatively assessed and standards can be enforced. This provides not only the sound rationale needed from a business standpoint, but the optimization objectives that are needed from an operational standpoint.

  • Manageable complexity. The probabilistic nature of agentic AI means that multi-agent systems entail scaffolding risks in which the failure of a single step might have far-reaching consequences. The theoretical benefits of an agentic system should clearly outweigh operational complexity.

Thorough analysis of enterprise deployments of agentic AI has yielded specific patterns that present consistent value and manageable risk: document-heavy processes, customer support (or customer service) and documentable knowledge work. These are ideal places to start.

Code & Build

Once all relevant stakeholders have agreed on the goals, requirements, constraints and measurement criteria, teams move into the process of actually building AI agents: implementing prompts, memory strategies, orchestration logic and evaluation frameworks.

Agents must be integrated with enterprise systems, APIs and external tools and knowledge bases. These integrations should be designed with security and telemetry in mind. Observability hooks—bits of code that automatically capture instantaneous operational data and measurements—should be injected at key workflow junctures to record agent transcripts, including agent reasoning traces, tool calls and outputs.

At each development stage, teams should implement strict version control policies for both individual agent variants and (where relevant) the orchestration logic that coordinates their work within a multi-agent system.

Model selection

Choosing which LLM (or LLMs) will power your AI agents is one of the most important architectural decisions to be made. Using one model for every task and role is rarely the optimal arrangement in terms of performance—and even when it is, that incremental performance improvement comes with tradeoffs in cost-efficiency, latency or both.

Developers should draw from a portfolio of different models: frontier reasoning models for complex planning, domain-specific models (obtained directly from model providers or through your organization’s own fine-tuning efforts) for specialized tasks where appropriate, smaller models to minimize cost and latency for simpler, high-volume tasks.

Tools

All integrations—whether you’re integrating enterprise data, third-party applications or external systems—can be treated as tool integrations enabled by MCP servers. Ideally, your agentic engineering platform of choice enables you to tailor MCP behavior to the needs of your specific use case. Use an MCP Gateway pattern to secure and govern all such connections through your backend systems.

Interoperability

Wherever possible, prioritize reproducibility and open standards, such as MCP for tools and resources, OpenTelemetry for observability, and reusable schemas for prompts. You should likewise adopt consistent patterns for storage and retrieval, tool access and task delegation.

Security and risk management

For enterprise systems that will be exposed to real-world risk, agentic AI security should be directly woven into each development step using secure-by-design principles, rather than retrofitted on after the fact.

Each AI agent should be issued a distinct identity tag to ensure that every action taken by an agent can be recorded, audited and properly attributed. This not only enables security issues to be reliably traced back to their source, but also facilitates compliance with regulatory frameworks that continue to evolve as agentic AI adoption matures.

Sandboxing and other containment practices are essential to constraining risk. An agent’s execution environment, network access and filesystem access should always operate on the principle of least privilege. Each component of an agentic system should be given the minimum permission necessary to achieve its designated tasks.

Test & Release

Testing agent prototypes to ensure that they’re ready for release into production requires more than the unit tests and static analysis of the traditional software design lifecycle. It must also entail extensive behavioral validation against real-world scenarios or high-fidelity simulations. Given the probabilistic nature of agentic systems, the sample size of these testing scenarios must be large and varied enough to provide reasonable confidence that all potential emergent agent behaviors have been observed and evaluated.

AI agents should be tested against predefined benchmarks and policy checks that accurately reflect and enforce desired behaviors. This might require the collection or creation of ground truth datasets that indicate the trajectory an agent should follow for each kind of input and situation. Both LLM-as-a-Judge and human-in-the-loop reviews should be used, balancing the scale enabled by the former with the confidence provided by the latter.

Before and after initial deployment, a robust continuous integration/continuous delivery (CI/CD) pipeline is crucial to running tests and evaluations at the necessary scale, automatically running evaluations, testing tool confidence and enforcing safety guardrails. During testing, a continuous integration (CI) system helps ensure that an agent’s reasoning logic doesn’t break when its constituent models and prompts are updated. Even swapping in the latest model version of the LLM you’re already using can have unpredictable effects in a dynamic environment.

Evaluation

Ongoing AI agent evaluation at every post-build phase of the ADLC is essential to the success of an agentic system. Offline evals during build and CI help benchmark overall agent behavior and results. In-the-loop evals are evoked at runtime to guide an agent’s individual decisions—for instance, in an agentic RAG application, your workflow could enforce the computation of a context relevance score to determine whether a retrieved source should be used to generate an output.

Your agentic evaluation framework should comprise multiple kinds of metrics, including:

  • Quality metrics, such as task success percentage, accuracy and tool-call success rate

  • Safety metrics, such as policy violations or sensitive data leakage rate

  • Operations metrics, such as latency, token consumption and cost per task

  • Business metrics, such as satisfaction scores or cost per outcome

Red teaming

Red teaming proactively identifies adversarial vulnerabilities and potential alignment failures. It simulates hostile conditions, such as prompt injection attacks and jailbreaking attempts, to test safety constraints in scenarios that standard behavioral testing might overlook.

Deploy

After they’ve been thoroughly tested, optimized and validated, AI agents are securely deployed into enterprise environments. The deployment phase should be understood as a deliberate, strategically tiered activation, rather than as a singular act akin to pressing a proverbial big red “DEPLOY” button. ADLC ensures system safety at runtime through sandboxing, version control, rollback strategies and failsafes. 

The rollout of your AI agents should be executed progressively to manage risk. Consider different rollout strategies, such as blue-green, rolling or canary deployments, to determine which is most conducive to your usage traffic patterns.  In a live, real-world enterprise environment, stability must remain top priority: carefully partitioning your rollout into stages allows you to verify your system’s resilience to meaningful updates.  

Sandboxing

Sandboxing is the practice of strictly limiting the reach and capabilities of agents and their tools by running them inside constrained execution environments that enforce least-privilege access to compute, storage, network and system APIs. Even if an agent fails or misbehaves, proper sandboxing minimizes the range and magnitude of potential issues. It’s a critical practice in any scenario wherein one agent’s tool misuse, code generation or data transformation might have consequences across your codebase, data integrity, customers or other agents.

Common implementation strategies for sandboxing include:

  • Lightweight virtualization

  • Container security profiles

  • Network controls (typically through an MCP Gateway)

  • Filesystem access policies

  • Gateway-level policy enforcement

Operate

The ADLC does not end once your AI agents have been fully and successfully deployed. Like the iterative loop formed by the Code & Build phase and Test & Release phase, the Deploy and Operate phases should be understood as two parts of an ongoing feedback loop. Whereas the end goal of the initial Build/Test loop is for agents to meet whatever minimum performance threshold is necessary to achieve desired business outcomes, the goal of the Deploy/Operate loop is optimization.

Following deployment, continuous operational oversight is necessary to ensure that your agent’s performance remains reliable, effective and secure in a real-world environment. Real-time metrics, compiled and readily accessible in a unified reporting dashboard, should be actively monitored for drift or performance regressions. Any notable regressions, whether in terms of operational efficiency or end user feedback, should be actively addressed. Any changes arising as solutions to those emergent problems should be tested thoroughly and deployed progressively.

Full-stack observability is critical to agentic systems that not only achieve optimal performance, safety and reliability, but also maintain it over time.

AI agents

What are AI agents?

From monolithic models to compound AI systems, discover how AI agents integrate with databases and external tools to enhance problem-solving capabilities and adaptability.

Monitor

Once your agentic implementation is fully validated and optimized in live production, conduct ongoing audits for fairness, transparency, security risks and regulatory compliance in addition to overall performance.

Industry standards and legal requirements continue to evolve, and a failure to actively keep up with both can result in regulatory consequences, competitive disadvantages or both. Model drift is an inevitable phenomenon that is best addressed with a proactive approach. Business needs change and the agentic systems in place to meet those needs will need to change accordingly.

For present and future needs, enterprises should run a clearly organized catalog of agents and tools that notes:

  • Ownership, to facilitate accountability and issue escalation

  • Versions, for a disciplined change management practice

  • Risk posture, to inform decision-making

  • All relevant environments, to maintain exhaustive operational oversight

  • Auditability, to expedite evidence-gathering, evaluations, approvals and red teaming

Author

Dave Bergmann

Senior Staff Writer, AI Models

IBM Think

Abstract portrayal of AI agent, shown in isometric view, acting as bridge between two systems
Related solutions
AI agents for business

Build, deploy and manage powerful AI assistants and agents that automate workflows and processes with generative AI.

    Explore watsonx Orchestrate
    IBM AI agent solutions

    Build the future of your business with AI solutions that you can trust.

    Explore AI agent solutions
    IBM Consulting AI services

    IBM Consulting AI services help reimagine how businesses work with AI for transformation.

    Explore artificial intelligence services
    Take the next step

    Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

    1. Explore watsonx Orchestrate
    2. Explore watsonx.ai