Digital illustration of side view of a woman holding an iPad with icons of dashboards in front and behind her

Introducing VAKRA: Benchmark for evaluating multi-hop, multi-source tool-calling capabilities in AI Agents

Explore how VAKRA can evaluate end-to-end agent behavior, where multi-step tasks span diverse data sources and require adherence to tool-usage guidelines.

VAKRA—eValuating API and Knowledge Retrieval Agents using multi-hop, multi-source dialogues—is a tool-grounded, executable benchmark designed to evaluate how well AI agents reason end-to-end in enterprise-like settings. 

Rather than testing isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows, not just individual steps.

VAKRA provides an executable environment where agents interact with over 8,000 locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.

  • Locally hosted, database-backed tools ensure deterministic, verifiable responses at evaluation.
  • Document retrieval is provided via domain-specific indices, enabling cross-source grounding and extraction.
  • Trajectory-level verification replays full agent traces against live tools, supporting multiple valid execution paths—crucial for enterprise workflows.

Multi-hop, multi-source reasoning matters

Enterprise environments don’t resemble single-turn Q&A or one-off function calls. Workflows in areas such as customer support, business intelligence and compliance require agents to chain decisions, reconcile mismatched schemas and follow tool-use policies expressed in natural language. Failures arise not only during tool invocation, but also in the language-mediated reasoning between tools—including entity disambiguation, cross-source grounding and parameter or schema alignment.

Consider a delayed order complaint in an ecommerce operation. To resolve it, an agent must correctly connect information across systems—linking customer records, interpreting carrier documentation, aligning identifiers across logistics APIs and applying policies expressed in natural language. Each decision depends on the one before it, requiring sustained reasoning across tools, data sources and constraints.

VAKRA is designed to surface exactly where such multi-step reasoning succeeds or breaks down, reflecting the realities agents face in production environments.

Use cases: Three progressively complex settings

Inspired by scenarios like the delayed-order complaint example from before, VAKRA organizes tasks into three tiers:

  1. Diverse API interaction styles: Agents must adapt to different interface abstractions, ranging from business-intelligence–style APIs that expose compositional or expanded function interfaces—requiring planning and careful tool selection—to query-aligned endpoints that encapsulate computation but still require accurate query interpretation and correct parameterization.
  2. Multi-hop reasoning over structured APIs: Tasks require 3–7 dependent API calls, where the output of earlier steps must be correctly interpreted, transformed, and reused to parameterize subsequent actions.
  3. Multi-hop, multi-source reasoning with tool-use policies: Tasks require multi-hop reasoning across unstructured documents and structured APIs, where agents must decide when to retrieve, how to ground retrieved information into downstream tool calls, and comply with natural-language tool-use policies.

Built for executable, verifiable evaluation

VAKRA runs in a self-hosted environment: APIs backed by persistent databases and retrieval indices are exposed via a standard interface, and agents can only interact through these tools. Evaluation replays entire trajectories to verify every intermediate step—not just final answers—so you can pinpoint where reasoning broke: entity disambiguation, cross-source mapping or policy interpretation.

VAKRA is designed for three different users:

  • Researchers studying agentic reasoning, multi‑tool planning and grounding
  • Developers and Engineering teams evaluating foundation models for production agent workflows
  • Leaders seeking benchmarks that reflect enterprise complexity, not toy tasks

Getting started and availability

VAKRA is publicly available today. The source code, task specifications and evaluation harness are open-sourced on Github, which includes everything needed to reproduce results and run new agents end-to-end, including:

  • Locally hosted, executable API environments backed by real databases
  • Domain-specific document collections for retrieval-augmented reasoning
  • A self-contained evaluation runner that replays and verifies complete agent trajectories
  • Scripts for benchmarking new models across API-only, multi-hop, and multi-source task settings

We are also launching a Hugging Face Space that will host the VAKRA public leaderboard. We invite researchers, practitioners and developers to submit results, and contribute feedback and extensions.

Explore on GitHub

Ankita Rajaram Naik

Research Data Scientist

Additional authors:

Acknowledgements

The authors thank colleagues across research and engineering teams for their valuable feedback, discussions, and support in developing this benchmark.

We especially acknowledge our interns, Raavi Gupta and Abhinav Jain, for their efforts in benchmark generation and development. We also acknowledge Chulaka Gunasekara, Hamid Adebayo, Harold Ship, Himanshu Gupta, Huaiyu Zhu, Jaydeep Sen, Renuka Sindhgatta, Sameep Mehta, Sara Rosenthal and Segev Shlomov for their contributions and insights.