HuggingFace Drops ScarfBench: 50 Tasks, 3 Frameworks
HuggingFace just released ScarfBench, a benchmark designed to evaluate AI agents on enterprise Java framework migrations. It covers three common migration paths: Struts to Spring Boot, EJBs to MicroProfile, and JSP to Thymeleaf. Each of the 50 tasks is a realistic, multi-step migration that involves refactoring code, updating configuration, and verifying behavior. The benchmark includes pre- and post-migration codebases, test suites, and human-validated expected outcomes. The idea is to test not just code generation but full workflow execution—something most existing benchmarks ignore. The agents can use any tools they want: LSP servers, grep, test runners. It's a sandboxed environment with a Docker container per task.
Why Java Migration Benchmarks Were Left Behind
Most AI coding benchmarks—like HumanEval or SWE-bench—focus on small, standalone problems: write a function, fix a bug. That's fine for measuring basic coding chops. But enterprise Java migrations are a different beast. They involve sprawling codebases, strict type systems, legacy patterns like JSP scriptlets or EJB entity beans, and dependency injection nuances. No benchmark has touched this space seriously. The closest is SWE-bench's repository-level issues, but those rarely span multiple files or require deep framework knowledge. Meanwhile, real enterprises spend millions of dollars and months of developer time migrating from Struts to something modern. Acroynms like Spring Boot have been around for over a decade, yet many banks and insurers are still on JSP. So the field needed this—a measure of whether agents can really handle the grimy, real-world work that makes developers cry.
What This Means for Enterprise AI Adoption
If agents can score well on ScarfBench, that's a green light for enterprises to start trusting AI on their messy monoliths. A 40% cost reduction in migration projects is plausible if you can automate 70% of the boilerplate. But here's the kicker: the benchmark reveals whether agents can actually *reason* about framework idioms, not just paste Stack Overflow snippets. For example, migrating a JSP custom tag to Thymeleaf dialect requires understanding both tag APIs and template engine semantics. That's a reasoning task, not a retrieval one. The big winners might be IDE vendors like JetBrains or IBM (via watsonx Code Assistant) who integrate agentic workflows directly into enterprise tooling. Of course, no agent has been reported to score above 60% accuracy yet. That's telling.
Open Questions: Generalization, Cheating, and Real Costs
ScarfBench tests exactly three migration paths. What about AngularJS to React? .NET to Java? The benchmark is a start, but it's narrow. Also, the sandbox environment might leak information—agents could memorize task patterns rather than learn migration principles. The creators claim they have a held-out 'evaluation set' of unseen tasks, but no details yet on how they prevent overfitting. Then there's the cost: running each agent on 50 tasks with Docker and network access could rack up hundreds of dollars. That prices out hobbyists and startups. HuggingFace says they'll release a smaller subset, but that undermines statistical reliability. Finally, what constitutes a 'pass'? Is it exact byte-for-byte match of the reference migration, or just passing the test suite? The paper leans toward test-passing plus structural similarity, but the threshold is fuzzy.