3D render of multiple colorful blocks clustered

ScarfBench: A public benchmark for Java framework migration

Raising the bar for enterprise Java modernization, ScarfBench introduces transparent and reproducible evaluation of AI-driven framework migration tools.

Today, we’re introducing ScarfBench—Self-Contained Application Refactoring Benchmark—an open benchmark suite and public leaderboard designed to evaluate automated and agentic enterprise Java migrations across Jakarta EE, Quarkus and Spring framework.

As organizations modernize mission-critical systems, framework migration has become a strategic priority. At the same time, AI-assisted development tools are increasingly being used to accelerate these transitions.

ScarfBench provides a standardized, reproducible way to evaluate whether an AI-driven migration produces a working, reliable system—not just compilable code. It enables consistent evaluation using validated enterprise-style workloads and transparent scoring.

Why enterprise migration requires rigorous evaluation

It is essential to ensure that enterprise applications after migration maintain functionality, quality and performance consistent with their original apps.

  • Enterprise migration must preserve:
  • Business logic and domain behavior
  • Transaction boundaries and consistency guarantees
  • Dependency injection lifecycles and architectural structure
  • Persistence mappings and relational integrity
  • Security configurations and integration contracts

Code that compiles does not ensure that an application will start correctly, maintain behavioral parity or operate reliably in production-like environments. ScarfBench establishes a common evaluation foundation for enterprise-scale migration to ensure that these categorizations have been met.

What ScarfBench provides

Scarfbench provides a suite of Java applications across frameworks and enables systematic assessment of AI agents’ ability to migrate enterprise Java applications while preserving functionality, idiomatic patterns, and architectural integrity.

Specifically, it provides the following seven components:

  1. Developer-verified enterprise applications implemented across Jakarta EE, Quarkus and Spring framework
  2. Focused examples that isolate specific enterprise technology concerns
  3. Whole applications that combine multiple architectural layers into complete systems
  4. Automated build and startup validation workflows
  5. Validation tests that verify runtime behavior and functional equivalence
  6. A public leaderboard for side-by-side comparison of tools and agents
  7. Comprehensive documentation, runtime CLI companion, and a Quick Start guide

Each workload has been manually implemented and validated by experienced developers to ensure functional equivalence and idiomatic framework usage across variants.

Two use cases

Large enterprise applications are organized into logical tiers (or layers) that separate concerns. Treating each of these layers as distinct can enable modernization to be done in layer. Our benchmark isolates core technologies in each layer for consistent, verifiable workflows that can be migrated and tested across. 

  • Focused workloads: Focused workloads isolate enterprise concerns such as persistence behavior, dependency injection patterns, integration mechanisms, web interfaces and security configurations. These examples allow teams to assess how well a migration tool handles specific framework constructs in controlled scenarios.
  • Whole applications: Whole applications integrate multiple architectural layers into realistic systems. These workloads evaluate whether a migration approach maintains build integrity, runtime stability and correct behavior across interacting layers.

Together, these focused workloads and full-application scenarios enable both targeted experimentation and system-level evaluation of migration approaches, and the benchmark is designed to expand over time with additional frameworks, more complex workloads and community-contributed scenarios.

Reproducible evaluation and transparent results

ScarfBench supports consistent, repeatable evaluation workflows. The public leaderboard aggregates performance metrics such as build success, startup validation and validation test outcomes, enabling objective comparison and measurable progress.

Who ScarfBench supports

ScarfBench supports a broad set of technical communities with interest in AI-driven application development and transformation:

  1. Research teams studying AI-assisted program transformation to evaluate their approach
  2. Tool builders developing automated modernization systems to assess the effectiveness of the tool
  3. Enterprise architects evaluating migration strategies
  4. Open-source contributors interested in reproducible benchmarking

More than a benchmark

ScarfBench is more than benchmark. With the combination of public benchmarks, reproducible tooling, transparent metrics, and a public leaderboard, ScarfBench provides the technical foundation you need to measure and compare any agentic solutions with confidence.

Explore ScarfBench

See an overview of benchmark

Read the Quick Start guide