Skip to content
Look who’s talking

The debut of Gemini 3.1 Flash Live could make it harder to know if you’re talking to a robot

Google’s new conversational audio AI is rolling out in search, Gemini, and developer tools today.

Ryan Whitwam | 9
Credit: Google
Credit: Google
Story text

Text generated by artificial intelligence often has a particular vibe that gives it away as machine-generated, but it has become harder to pick out those idiosyncrasies as the tech has improved. We may be seeing a similar evolution of generative AI audio. Google has announced a new AI audio model called Gemini 3.1 Flash Live—as the name implies, it’s designed for real-time conversation. It’s rolling out in some Google products starting today, and developers will be able to start building their own chatty robots with the model, too.

Google says this AI is much faster and produces speech with a more natural cadence, aiming to solve a long-running issue with AI-generated speech. Like a chatbot, there’s always a delay between input and output in generative audio systems. Longer delays and unnatural inflection make conversations feel sluggish and harder to follow. Researchers generally believe 300 milliseconds of latency is about the limit for optimal speech perception, but Google has not specified any particular delay for Gemini 3.1 Flash Live. It just vaguely has the speed you need.

But benchmark numbers? Google has plenty of those, which it claims show that 3.1 Flash Live will be a more reliable way to have audio-to-audio AI conversations. For example, a big gain in the ComplexFuncBench Audio shows the new model is better at complex, multi-step tasks. Gemini 3.1 Flash Live also tops the charts in the Big Bench Audio test, which evaluates reasoning with a set of 1,000 audio questions.

Meanwhile, a strong showing in Scale AI’s Audio MultiChallenge means the new Gemini model is more able to cope with hesitation and interruptions in the audio input. Although it outpaces other real-time audio models, Gemini 3.1 Flash Live only manages 36.1 percent in this test. Audio models that are not designed to operate conversationally can reach scores over 50 percent in the MultiChallenge.

Credit: Google

 

The upshot is that Gemini 3.1 Flash Live should sound more like a person, to the point that Google felt it was time to integrate AI flags. The outputs from this model will have SynthID watermarks, which are not perceptible to human listeners. However, they can be detected if someone were to try to pass off Gemini AI speech as the real deal.

Google has partnered with companies like Home Depot, Verizon, and others to test the model. They all have glowing reports in the blog post on how well 3.1 Flash Live can mimic human speech. So the next AI assistant you encounter on a phone call might sound much more realistic. Maybe you’ll even think you’re talking to a person, and SynthID can’t help with that.

Developers can now access the model in AI Studio, the Gemini API, and Gemini Enterprise for Customer Experience. The latter is essentially a toolkit for agentic shopping. Gemini 3.1 Flash Live will be seen most prominently in Gemini Live and Search Live (a feature of AI Mode). The new conversational AI is rolling out in those products starting today.

Photo of Ryan Whitwam
Ryan Whitwam Senior Technology Reporter
Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he's written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.
9 Comments