ScarfBench: New Benchmark Tests AI Agents on Java Migrations

HuggingFace

July 1, 2026

◷ 3 MIN

Original source

huggingface.co — read the full announcement →

HuggingFace Drops ScarfBench: 50 Tasks, 3 Frameworks

HuggingFace just released ScarfBench, a benchmark designed to evaluate AI agents on enterprise Java framework migrations. It covers three common migration paths: Struts to Spring Boot, EJBs to MicroProfile, and JSP to Thymeleaf. Each of the 50 tasks is a realistic, multi-step migration that involves refactoring code, updating configuration, and verifying behavior. The benchmark includes pre- and post-migration codebases, test suites, and human-validated expected outcomes. The idea is to test not just code generation but full workflow execution—something most existing benchmarks ignore. The agents can use any tools they want: LSP servers, grep, test runners. It's a sandboxed environment with a Docker container per task.

Why Java Migration Benchmarks Were Left Behind

Most AI coding benchmarks—like HumanEval or SWE-bench—focus on small, standalone problems: write a function, fix a bug. That's fine for measuring basic coding chops. But enterprise Java migrations are a different beast. They involve sprawling codebases, strict type systems, legacy patterns like JSP scriptlets or EJB entity beans, and dependency injection nuances. No benchmark has touched this space seriously. The closest is SWE-bench's repository-level issues, but those rarely span multiple files or require deep framework knowledge. Meanwhile, real enterprises spend millions of dollars and months of developer time migrating from Struts to something modern. Acroynms like Spring Boot have been around for over a decade, yet many banks and insurers are still on JSP. So the field needed this—a measure of whether agents can really handle the grimy, real-world work that makes developers cry.

What This Means for Enterprise AI Adoption

If agents can score well on ScarfBench, that's a green light for enterprises to start trusting AI on their messy monoliths. A 40% cost reduction in migration projects is plausible if you can automate 70% of the boilerplate. But here's the kicker: the benchmark reveals whether agents can actually *reason* about framework idioms, not just paste Stack Overflow snippets. For example, migrating a JSP custom tag to Thymeleaf dialect requires understanding both tag APIs and template engine semantics. That's a reasoning task, not a retrieval one. The big winners might be IDE vendors like JetBrains or IBM (via watsonx Code Assistant) who integrate agentic workflows directly into enterprise tooling. Of course, no agent has been reported to score above 60% accuracy yet. That's telling.

Open Questions: Generalization, Cheating, and Real Costs

ScarfBench tests exactly three migration paths. What about AngularJS to React? .NET to Java? The benchmark is a start, but it's narrow. Also, the sandbox environment might leak information—agents could memorize task patterns rather than learn migration principles. The creators claim they have a held-out 'evaluation set' of unseen tasks, but no details yet on how they prevent overfitting. Then there's the cost: running each agent on 50 tasks with Docker and network access could rack up hundreds of dollars. That prices out hobbyists and startups. HuggingFace says they'll release a smaller subset, but that undermines statistical reliability. Finally, what constitutes a 'pass'? Is it exact byte-for-byte match of the reference migration, or just passing the test suite? The paper leans toward test-passing plus structural similarity, but the threshold is fuzzy.

Frequently Asked Questions

What is ScarfBench exactly?▾

ScarfBench is a benchmark suite from HuggingFace that tests AI agents on enterprise Java framework migrations. It includes 50 realistic tasks covering three major migrations: Struts to Spring Boot, EJBs to MicroProfile, and JSP to Thymeleaf. Each task requires the agent to refactor code, update configurations, and run tests in a sandboxed Docker environment.

Why target Java framework migrations specifically?▾

Because enterprise Java migrations are notoriously labor-intensive and error-prone. They involve deep knowledge of legacy APIs, modern frameworks, and the subtle differences between them. No existing benchmark measured an agent's ability to perform such multi-step, context-rich migrations, so ScarfBench fills that gap.

How is performance measured in ScarfBench?▾

Can any AI model run on ScarfBench?▾

Yes, any agent—from open-source models like CodeLlama to proprietary ones like GPT-4—can be evaluated, as long as it operates in the provided Docker environment. The benchmark by default uses an agent that can execute shell commands, edit files, and run tests autonomously.

What are the biggest limitations of ScarfBench?▾

It only covers three migration paths, missing many common enterprise scenarios like REST API migrations or cloud-native adoption. The environment may also allow memorization if tasks are reused. And the evaluation cost could limit participation. It's a strong first step, but not yet a comprehensive enterprise benchmark.