Raising the bar for enterprise Java modernization, ScarfBench introduces transparent and reproducible evaluation of AI-driven framework migration tools.
Today, we’re introducing ScarfBench—Self-Contained Application Refactoring Benchmark—an open benchmark suite and public leaderboard designed to evaluate automated and agentic enterprise Java migrations across Jakarta EE, Quarkus and Spring framework.
As organizations modernize mission-critical systems, framework migration has become a strategic priority. At the same time, AI-assisted development tools are increasingly being used to accelerate these transitions.
ScarfBench provides a standardized, reproducible way to evaluate whether an AI-driven migration produces a working, reliable system—not just compilable code. It enables consistent evaluation using validated enterprise-style workloads and transparent scoring.
It is essential to ensure that enterprise applications after migration maintain functionality, quality and performance consistent with their original apps.
Code that compiles does not ensure that an application will start correctly, maintain behavioral parity or operate reliably in production-like environments. ScarfBench establishes a common evaluation foundation for enterprise-scale migration to ensure that these categorizations have been met.
Scarfbench provides a suite of Java applications across frameworks and enables systematic assessment of AI agents’ ability to migrate enterprise Java applications while preserving functionality, idiomatic patterns, and architectural integrity.
Specifically, it provides the following seven components:
Each workload has been manually implemented and validated by experienced developers to ensure functional equivalence and idiomatic framework usage across variants.
Large enterprise applications are organized into logical tiers (or layers) that separate concerns. Treating each of these layers as distinct can enable modernization to be done in layer. Our benchmark isolates core technologies in each layer for consistent, verifiable workflows that can be migrated and tested across.
Together, these focused workloads and full-application scenarios enable both targeted experimentation and system-level evaluation of migration approaches, and the benchmark is designed to expand over time with additional frameworks, more complex workloads and community-contributed scenarios.
ScarfBench supports consistent, repeatable evaluation workflows. The public leaderboard aggregates performance metrics such as build success, startup validation and validation test outcomes, enabling objective comparison and measurable progress.
ScarfBench supports a broad set of technical communities with interest in AI-driven application development and transformation:
ScarfBench is more than benchmark. With the combination of public benchmarks, reproducible tooling, transparent metrics, and a public leaderboard, ScarfBench provides the technical foundation you need to measure and compare any agentic solutions with confidence.