What Thousand Token Wood Actually Is
Hugging Face released a demo called Thousand Token Wood. It's a simulation where 100 AI agents trade wood, stone, and gold in a minimal economy. Each agent runs on a 3B-parameter language model—specifically a quantized version of Gemma 2B. The entire thing fits in 12GB of VRAM. Agents negotiate prices, form contracts, and even cheat. It's not a game; it's a proof-of-concept for running multi-agent systems on commodity hardware. The simulation processes 1,000 tokens per agent per round. Hence the name.
Why a 3B Model Actually Matters Here
Most multi-agent research uses models like GPT-4 or Claude 3.5 Opus, costing thousands of dollars per run. Thousand Token Wood deliberately uses the smallest possible model. That changes the economics. If you can run 100 agents on a single RTX 4090, you can iterate fast. The prior state of the field was that multi-agent systems were reserved for labs with big budgets. Hugging Face's point is that you don't need that. The model's small size also forces compression: fewer parameters mean simpler agent strategies, which actually makes the economy more interpretable. It's a case of constraints becoming features.
What This Means for Agentic AI and Research
Honestly, this is more interesting for the infrastructure than the simulation itself. If multi-agent economies can run on a 3B model, then the bottleneck isn't model size—it's coordination overhead. Hugging Face also open-sourced the orchestration code. That's the real deliverable. For researchers, this means you can test economic theories with agent-based models without burning API credits. For startups building AI NPCs or automated negotiating bots, this is a viable path. But don't expect these agents to pass Turing tests. They're dumb and limited, which makes the emergent behavior (like collusion or price-fixing) all the more surprising.
The Missing Benchmarks and Real-World Caveats
The big question: does the economy actually resemble human behavior? The blog post shows agents spontaneously forming cartels, but is that robust or a fluke? We don't know. They didn't run systematic ablation studies. They also used a single prompt template—no variance. The model is likely not fine-tuned for economic reasoning, so results might be brittle. Another unknown: how does it scale? 100 agents is cute; 10,000 agents with 3B models would break any single GPU. Hugging Face didn't release a multi-GPU version. Also, the token limit of 1,000 per round forces short interactions. Real-world negotiations need more context. Watch for replication attempts.