Explore how VAKRA can evaluate end-to-end agent behavior, where multi-step tasks span diverse data sources and require adherence to tool-usage guidelines.
VAKRA—eValuating API and Knowledge Retrieval Agents using multi-hop, multi-source dialogues—is a tool-grounded, executable benchmark designed to evaluate how well AI agents reason end-to-end in enterprise-like settings.
Rather than testing isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows, not just individual steps.
VAKRA provides an executable environment where agents interact with over 8,000 locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.
Enterprise environments don’t resemble single-turn Q&A or one-off function calls. Workflows in areas such as customer support, business intelligence and compliance require agents to chain decisions, reconcile mismatched schemas and follow tool-use policies expressed in natural language. Failures arise not only during tool invocation, but also in the language-mediated reasoning between tools—including entity disambiguation, cross-source grounding and parameter or schema alignment.
Consider a delayed order complaint in an ecommerce operation. To resolve it, an agent must correctly connect information across systems—linking customer records, interpreting carrier documentation, aligning identifiers across logistics APIs and applying policies expressed in natural language. Each decision depends on the one before it, requiring sustained reasoning across tools, data sources and constraints.
VAKRA is designed to surface exactly where such multi-step reasoning succeeds or breaks down, reflecting the realities agents face in production environments.
Inspired by scenarios like the delayed-order complaint example from before, VAKRA organizes tasks into three tiers:
VAKRA runs in a self-hosted environment: APIs backed by persistent databases and retrieval indices are exposed via a standard interface, and agents can only interact through these tools. Evaluation replays entire trajectories to verify every intermediate step—not just final answers—so you can pinpoint where reasoning broke: entity disambiguation, cross-source mapping or policy interpretation.
VAKRA is designed for three different users:
VAKRA is publicly available today. The source code, task specifications and evaluation harness are open-sourced on Github, which includes everything needed to reproduce results and run new agents end-to-end, including:
We are also launching a Hugging Face Space that will host the VAKRA public leaderboard. We invite researchers, practitioners and developers to submit results, and contribute feedback and extensions.
Additional authors:
The authors thank colleagues across research and engineering teams for their valuable feedback, discussions, and support in developing this benchmark.
We especially acknowledge our interns, Raavi Gupta and Abhinav Jain, for their efforts in benchmark generation and development. We also acknowledge Chulaka Gunasekara, Hamid Adebayo, Harold Ship, Himanshu Gupta, Huaiyu Zhu, Jaydeep Sen, Renuka Sindhgatta, Sameep Mehta, Sara Rosenthal and Segev Shlomov for their contributions and insights.