Gemini’s efficiency play

Gemini 3.5 Flash might be fast enough for gen AI to make sense

Google says its more efficient Gemini 3.5 Flash is the key to your agentic AI future.

Ryan Whitwam – May 19, 2026 2:11 pm | 89

Credit: Aurich Lawson

At last year’s I/O event, Google was still talking about the 2.5 branch of Gemini, and what a difference a year makes. We’ve gone through the 3.0 and 3.1 families since then, and now it’s on to version 3.5. Gemini 3.5 Flash is rolling out across a wide range of Google products starting today, and Google again claims this model is even better than its last-gen Pro model.

That has been a trend with Google’s tick-tock model updates over the past year, but the team says this release is special. Gemini 3.5 Flash supposedly offers frontier-level intelligence while also being efficient enough that it may finally make complex agentic tasks worth doing at scale. Tulsee Doshi, senior director of product management for Gemini, explains that the innovations of Gemini 3.5 Flash are woven through multiple Google products, and this is just the start.

It’s no secret that generative AI is currently a money pit, and all the major AI players are trying to find paths to greater efficiency. The problem is magnified when you start building agentic experiences that are supposed to run for longer to complete complex tasks. Gemini 3.5 Flash may be a big step toward making that viable. The new model can output nearly 300 tokens per second, but its benchmark scores are similar to larger frontier models (like 3.1 Pro) that build outputs at a quarter of that speed.

Google now says that the companies using the most AI tokens could save a billion dollars per year by shifting to the more efficient Gemini 3.5 Flash. API pricing for the new model is significantly lower than the Pro model it apes. Gemini 3.5 Flash clocks in at $1.50 per 1M input tokens and $9 per 1M output tokens. The 3.1 Pro model starts at $2 and $12, respectively, and it’s higher if you use more than 200k tokens.

According to Doshi, the team made numerous improvements in pre-training with Gemini 3.5 Flash, but insights gleaned from how devs use Gemini models are really paying off.

“With post-training, we’re really starting to unlock some of the value of the feedback we’re getting from users, for example, from Antigravity,” said Doshi. “That’s really what you’re seeing play out in terms of the code performance and the tool use performance. And then, the hope is that you’ll continue to see the step change where 3.5 Pro will be better, and the next Flash meets Pro performance with that series.”

Google is focused on code generation with the new model, which is a core agentic angle for AI. Both Terminal Bench and SWE-Bench Pro tests show substantial improvements—3.5 Flash clobbers older Flash models and shows a small but measurable improvement versus Gemini 3.1 Pro. Its scores are in the same neighborhood as OpenAI’s much larger and more expensive GPT 5.5.

A major barrier in agentic workflows is how generative models can use interfaces designed for humans. It’s not an easy problem to solve, Doshi said. “Certain things like UI control are expensive to do because the model has to search the page, it has to know where to click, it has to act through multiple steps. I think Flash is able to do that well because of that combination of quality and cost.”

Google’s AI evaluations demonstrate these improvements, too. Among Google’s current collection of benchmarks is OSWorld-Verified, which tests how models handle general tasks in real computing environments. It’s similar to the coding improvements. Gemini 3.5 Flash substantially outperforms older Flash models and is even a bit faster than Gemini 3.1 Pro. It’s essentially tied with GPT 5.5.

Google’s new Flash model is, again, a little better than the last-gen Pro. Credit: Google

Gemini 3.5 Flash has been rolled out internally at Google, and Doshi noted that it’s having a big impact. “We have a set of internal metrics we’ve been evaluating that measures how Googlers code, so looking at our own code bases and how well the models perform on that,” Doshi said. “And you can see a massive, massive jump between where 3.1 Pro was and where 3.5 Flash is.”

Google unveiled the Antigravity IDE last year, and it’s being upgraded to version 2.0 with support for Gemini 3.5 Flash. This update will support multiple parallel workflows—essentially sub-agents spawned by Gemini 3.5 Flash. Again, Google says this is only possible because the new model is so efficient at spitting out tokens.

In addition to Antigravity, Gemini 3.5 Flash is coming to the Gemini app, the API, AI Studio, Android Studio, and all of Google’s enterprise products. As for the Pro variant, Google says that’s already in internal testing, and it should be ready for release next month.

Gemini Spark is 3.5 Flash in agent form

Companies are moving on from “AI” as their primary buzzword to “agents.” With Gemini Spark, Google is offering its first dedicated agent to users. Spark runs 24/7 in Google’s cloud, so it doesn’t use any of your computing resources and isn’t tied to any specific device or browser tab. Instead, it spans your entire Google footprint, using Gemini Flash 3.5 to run multiple agentic workflows at your command.

Google doesn’t always explain its buzzwords very well. So what is an AI agent anyway? Google’s Doshi explains: “I think of agents as being able to take a model plus a harness [software interface] such that the combination can actually take action on your behalf.”

With Spark, you can give the AI instructions, and it handles the task. This can take place over time as the agent grabs context from your Drive files, Gmail, and more. You could have it watch for certain emails and integrate them into daily digests or have it monitor your meetings and generate summaries and action items. Spark can send you notifications or ask follow-up questions to better meet your needs, too, and Google stresses that it’s designed to ask for your approval before undertaking “high-stakes actions.”

Doshi says she has been a daily user of Gemini Spark during internal testing over the past few weeks, using it for personal and professional tasks. She provided two examples of Spark agents she uses. In the run-up to I/O, she used Spark to pull together evaluations and other stats on 3.5 Flash to build a slide deck for Google higher-ups. “It turned out beautifully,” she said. “Probably better and in much less time than I would have been able to do.”

On the personal side, she created an agent that tracks developmental milestones for her new child. The agent provides insights into the data and suggests other metrics worth tracking. “I’m treating my child like an AI model,” Doshi joked. “I realize that, but it has been very helpful.”

A lot of people may turn up their noses at providing so much personal data to an AI model running in Google’s cloud, but sensibilities may adjust if this stuff becomes truly useful. Many of the ways people share data with Google today would have been unthinkable 10 or 15 years ago.

Spark is rolling out to AI Ultra subscribers starting next week. Google has added a new tier of Ultra, which gives you access to the latest features. It costs $100 per month, which most would still consider an astronomical amount for AI tools, but the $200 per month tier ($50 lower than before) still exists for those who want higher token limits. Google says the plan is to roll Spark out to all users (even those who don’t pay for Gemini) down the road.

Gemini Omni: an everything model (eventually)

Veo 3, Google’s concerningly good video model, debuted at last year’s I/O, but there’s a new video-generator in town this year. Gemini Omni Flash will be replacing Veo in products like the Gemini app, YouTube, and Flow. Google says Omni was designed to be truly multimodal, so it can accept any kind of input data and produce anything you want—images, text, video, or audio. However, it doesn’t do most of that right now. Google is starting with video, hence the swap with Veo.

While it’s similar to the new Gemini 3.5 models, Omni Flash is not explicitly part of that branch. This is something unique at Google, and it could represent a new direction for the company’s AI products. “The vision for Gemini has always been that it would be multimodal in, multimodal out,” Doshi said. “Omni is a step toward that vision.”

An example of AI video created by Gemini Omni.

Right now, you have to connect to the model that does what you want. For images, Google routes your prompt to Nano Banana. If you want music, your input goes to Lyria. Developers must plug in to the right API, and not all models are available in all tools. The day could be coming that everything passes through a unified model like Omni, but it’s still early days, and the Gemini team isn’t yet sure how Omni will develop.

The next few months will be telling as Google looks at opening the Omni model up to more output types to see how it performs compared to Google’s other models. “We might find that there are certain use cases that really benefit from their own custom model and specific focus,” said Doshi. “It’s not fully proven out yet that in the next few months we can pull everything into one experience.”

The first Omni release is a Flash model, meaning it’s smaller than the frontier Pro models. Google does intend to release an Omni Pro model at some point, but there’s no timeline for that. If multimodality in Omni comes together, these models may eventually form the basis for future Gemini releases to simplify Google’s AI ecosystem.

Updated 5/19 6PM ET with API pricing.

Ryan Whitwam Senior Technology Reporter

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he's written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

89 Comments