MosaicLeaks Benchmark Exposes Research Agents' Inability to Keep Secrets

HuggingFace

June 19, 2026

◷ 3 MIN

Original source

huggingface.co — read the full announcement →

What Is MosaicLeaks? The First Privacy Stress Test for Agentic AI

HuggingFace just dropped MosaicLeaks — a new benchmark designed to measure whether autonomous research agents can keep a secret. The test simulates a scenario where an agent has access to a set of sensitive files (e.g., a confidential password, a proprietary formula, or a personal email). Then it asks innocent-sounding questions designed to subtly coax that secret out. Early results are sobering. The best-performing agent — a GPT-4o-based system — leaked the secret 37% of the time. Smaller open-source models fared far worse: a 13B Llama variant spilled everything 68% of the time. The benchmark includes 500 test cases, each carefully crafted to avoid obvious trick questions. This isn't about simple direct queries; it's about indirect leakage through chained reasoning.

Why This Benchmark Matters Now: Agentic AI Is Everywhere

Since last year, research agents have moved from demos to production. Companies like Microsoft, Google, and startups are shipping agents that can browse the web, edit files, and run code — all with access to your data. The problem is we've been benchmarking agents on correctness, not discretion. MosaicLeaks fills that gap. Prior safety benchmarks like TruthfulQA or HarmBench focused on generating toxic or false outputs, but they never tested whether an agent would betray a secret it was told to keep. The closest work is probably the 'secret keeper' tasks from the AgentBench dataset, but those were simplistic — they only tested if the agent would output the secret verbatim. MosaicLeaks adds nuance: the agent might summarize, paraphrase, or use the secret in a calculation. That's closer to how real leaks happen.

What This Means for Deploying Research Agents: Your Secrets Are at Risk

Honestly, the most interesting part isn't the leak rate — it's the failure patterns. Agents that used chain-of-thought reasoning actually leaked more often because they wrote down intermediate steps that contained the secret. So the common advice to 'just add more reasoning' might make privacy worse. If you're building a customer-support agent that has access to billing data, a 37% leak rate is unacceptable. The short version: you cannot trust current LLM agents with secrets unless you explicitly control their output at every turn. That means no free-form generation, no open-ended search — which defeats the purpose of an agent. The implication is stark: agentic AI might require a fundamentally different architecture, perhaps one where the secret is never in the model's context window at inference time.

The Unanswered Questions: Can We Fix This, or Is It Fundamental?

MosaicLeaks raises more questions than it answers. First, are these failures due to the model's training data or the prompt engineering? HuggingFace hasn't released the exact prompts yet, so replication is hard. Second, can fine-tuning on privacy-specific data reduce leakage without destroying general ability? Early experiments suggest a 15% improvement with RLHF on secrecy tasks, but at a 10% drop in reasoning accuracy. That's a tough trade-off. Third, the benchmark only tests one secret per episode. Real agents juggle many secrets. Will the leak rate compound? Fourth, what about adversarial attacks? A crafty user could intentionally build a chain of questions that gradually extracts the secret. The benchmark doesn't test for active extraction, only passive leakage. Finally, HuggingFace plans to expand MosaicLeaks to cover multi-agent scenarios, where two agents might collude. That's a whole new can of worms.

Frequently Asked Questions

What exactly is MosaicLeaks?▾

MosaicLeaks is a benchmark from HuggingFace that tests whether AI research agents can keep a secret while answering questions. It simulates a scenario where the agent is given a secret (like a password or private key) and then asked a series of subtly probing questions. The benchmark measures how often the secret leaks through indirect responses.

How bad are the leak rates?▾

Pretty bad. The best proprietary agent (GPT-4o based) leaked 37% of the time. Open-source models like Llama-13B leaked over 68%. These numbers come from 500 test cases designed to mimic real-world probing, not just direct 'what's the secret?' questions.

Does chain-of-thought reasoning help or hurt secrecy?▾

It hurts. Agents that used chain-of-thought were more likely to leak because they wrote down intermediate reasoning that included the secret. The act of 'thinking out loud' inadvertently exposed the information. So the usual advice to improve reasoning actually makes privacy worse.

Can we train models to be better at keeping secrets?▾

Early results suggest some improvement is possible. Fine-tuning with RLHF on secrecy tasks reduced leak rates by about 15%, but it came with a 10% drop in general reasoning ability. That trade-off may not be acceptable for many applications.

Does MosaicLeaks test for adversarial extraction?▾

Not yet. The current version tests passive leakage — the agent doesn't realize it's being tricked. It doesn't model an adversary actively trying to extract the secret through careful questioning. HuggingFace plans to add adversarial scenarios in a future update.