🔷DeepMind

DeepMind gives Gemini 3.5 Flash desktop control

DeepMind

June 26, 2026

◷ 3 MIN

Original source

deepmind.google — read the full announcement →

What Gemini 3.5 Flash can now do with a screen

DeepMind just announced that Gemini 3.5 Flash can now operate a computer interface directly. The model — a 7 billion parameter variant of the Gemini 3.5 family — takes screenshots of the user's desktop, interprets them, and executes actions like clicking buttons, typing text, and navigating menus. It's powered by a new multimodal training regime that includes millions of synthetic screen recordings paired with action traces. Google claims the system can complete web-based tasks like filling out multi-step forms or booking a flight with 72% success, measured against a benchmark of 100 common desktop workflows. The feature is rolling out to API users starting today, with a public demo expected at Google I/O next month.

Where this fits in the agentic AI arms race

Anthropic's Claude launched a similar computer-use feature in October. OpenAI followed with Operator in February. Google was late to the party — so late that some analysts wondered if they'd even show up. The company's Gemini 3.0 models had limited tool-use capabilities, but nothing that could take over a cursor. The Flash variant changes that. By coupling a lightweight model (7B parameters, runs locally on a decent laptop) with a dedicated action-planning module, DeepMind sidesteps the latency problems that plagued Claude's early computer-use attempts. The training data is also synthetic, which means they aren't scraping people's desktops. That's a PR win after years of privacy scandals. The short version: Google needed a fast, low-cost agentic model. This is it.

What this means for developers and everyday users

For developers, the implications are immediate. If you're building an automation tool that needs to navigate legacy enterprise software — think SAP, Salesforce, or a clunky hospital records system — you can now point Gemini 3.5 Flash at a screen and let it handle the clicks. DeepMind claims the model costs $0.001 per screen-action request, making it cheap enough for high-volume tasks. For consumers, the use case is thinner. Do you really want an AI moving your mouse while you watch? Maybe for data entry or testing, but the real value is backend. One concrete example: a startup could use it to automate QA testing by feeding Gemini screenshots of a web app and asking it to find broken links. That's not science fiction. That's an API call away.

The limitations and open questions DeepMind isn't talking about

DeepMind's demo is impressive, but the fine print matters. The 72% success rate is on curated tasks — not real-world chaos. The model struggles with pop-ups, multi-monitor setups, and any interface that changes layout dynamically. It also has no memory of previous sessions, so you can't say 'remember that page from yesterday' and expect it to work. Latency is another issue: each action takes 1.5 to 3 seconds, which adds up for a 20-step workflow. And then there's safety. The model can execute arbitrary clicks and keystrokes. What stops it from accidentally deleting system files or buying things without your consent? DeepMind says they've implemented 'action confirmation' for sensitive operations, but they haven't published the red-team results. We'll need to see those before trusting it with a real bank account.

Watch video

Click to play

Frequently Asked Questions

Does Gemini 3.5 Flash with computer use run on-device?▾

Yes, the 7B parameter model can run locally on a laptop with an M3 or equivalent GPU. DeepMind also offers a cloud API for devices without enough horsepower.

How does this compare to Claude's computer use feature?▾

Gemini 3.5 Flash is faster and cheaper — about 40% lower latency and 30% lower cost per action. But Claude has better accuracy on complex multi-step tasks (78% vs. 72% on similar benchmarks).

What training data did DeepMind use to teach the model to click things?▾

They generated millions of synthetic screen recordings using automated scripts that simulate human interaction with web apps and desktop software. No real user screenshots were used, which avoids privacy concerns.

Can the model handle any software or only web browsers?▾

It works with any application that can be rendered on a screen — web browsers, native apps, even terminal emulators. However, it's optimized for GUI-based interactions and struggles with command-line interfaces.

Is there a risk of the model causing accidental damage like deleting files?▾

DeepMind has implemented action approval for operations that modify system files, send emails, or involve financial transactions. Users can also set custom deny-lists. Red-team testing results are expected next quarter.