Google announces Gemini 3.1 Pro, says it’s better at complex problem-solving

conradlee

Smack-Fu Master, in training
1
Google is quick to release these models in preview mode--and then months go by before they go out in GA. As someone actually trying to use these models in production settings, that's very frustrating. The latest GA model from google out right now is the gemini 2.5 family, which debuted around a year ago.

By the time the models reach GA (and can be used in real production settings) they are no longer cutting edge.

Contrast this with anthropic, which releases to GA right out of the gate, and you can see why it's frustrating. Sometimes feels like google is out to dazzle shareholders, whereas anthropic just wants to delight the actual users.
 
Upvote
67 (78 / -11)

Dachannien

Ars Scholae Palatinae
1,132
Subscriptor
gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.
I asked Gemini for a more impressive version number, and it suggested "Gemini 4.0 (Base-16)".
 
Upvote
5 (15 / -10)

cleek

Ars Scholae Palatinae
1,025
If something is run on vibes, but it is being measured as if it was objective, then those rankings should be dismissed.

Again, what problems do these LLMs solve, and how do they generate value for the user?

some people find them really good for generating routine business documents (offer letters, routine communications, etc).

that kind of thing could probably be solved with a decent document library. but, that would probably cost a lot more than asking an LLM to generate a new doc when you need it.

i find no use at all for them, personally.
 
Upvote
-8 (11 / -19)

MilanKraft

Ars Tribunus Angusticlavius
6,711
Wow this update really hits different.

It's "ready" for our toughtest challenges, is it? Does this mean when we ask it to infer an answer from basic social, economic, or scientific principles, the brainless wonder will only make shit up 30% of the time now? So much progress, I can hardly contain my excitement.
 
Upvote
38 (56 / -18)

pagh

Ars Praetorian
529
Subscriptor++
gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.

Maybe seven years ago, well before the current generation of LLMs, I was working at a Big Tech company on some random website stuff. We had a whole team of machine learning engineers for several weeks building a model for page design that it was hoped would increase conversion rates in one particular workflow from like 89% to more like 93%. So much talent just to get more people to click on the button to buy stuff they don't need.

When money is the only thing that matters, decisions like that are basically inevitable.
 
Upvote
53 (55 / -2)

Wallachia

Ars Scholae Palatinae
1,187
I'm the dumdum who who clicked on the article, so I'm part of the problem. But it seems like for software in most other domains, a dot release does not merit a standalone article. Maybehaps we should be treating quote-unquote "AI" the same way.
I mean, a dot release definitely merits an article if, for example, the release fixes a CVE 9.0+ vuln
 
Upvote
10 (11 / -1)

Bash

Ars Scholae Palatinae
1,467
Subscriptor++
I was just using Gemini Enterprise (I believe it is 3.0 Pro 'Thinking') yesterday. I have access to via my workplace. I had it write a few lines of code related to basic physics and it made a very simple mistake with a numeric derivative to calculate velocity from position.

How this model is able to claim any level of scientific / mathematical knowledge and not even write a few lines of code to calculate the most basic numerical derivative is an absolute mystery to me.
 
Upvote
44 (46 / -2)
we're really tilting into the "bad joke" phase, aren't we? i mean, even if the AI "tools" were fit for purpose (which they are not in the vast majority of cases), how does a constantly-changing tool (the model-de-jour) fit into any production environment?

google might have a different release pace than anthropic or (fill in the blank), but they're all on timeframes that are small fractions of any major development project, much less the timeframes you're thinking about if you care about maintainability and business continuity.

it's not going to happen immediately when the bubble bursts, buy my guess is that over time there will actually be a strong market for human IT skills because there is going to be one enormously large hot mess to clean up. i guess that's the "bright side"?
 
Upvote
12 (18 / -6)

AI_Skeptic

Wise, Aged Ars Veteran
179
that kind of thing could probably be solved with a decent document library. but, that would probably cost a lot more than asking an LLM to generate a new doc when you need it.
There are a lot of books out there, cheap books, that teach people how to write formal letters with fillable templates for routine letters. One book is Penguin Writers' Guides: How to Write Better Business Letters is one such book, and it's only $10 at Office Dept. Which, I'm sure was used as an training material by an LLM.
 
Upvote
37 (38 / -1)

KonaKat

Wise, Aged Ars Veteran
152
Ok, but can it tell me the results of the South Dakota International Hotdog Championship?

That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.


According to the "results" he fabricated:
  • 1st Place: Thomas Germain (7.5 hot dogs)
  • 2nd Place: Martin Gatsbino
  • 5th Place: Maxwell Zeff
In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.
 
Upvote
45 (46 / -1)

Uncivil Servant

Ars Scholae Palatinae
4,667
Subscriptor
That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.


According to the "results" he fabricated:
  • 1st Place: Thomas Germain (7.5 hot dogs)
  • 2nd Place: Martin Gatsbino
  • 5th Place: Maxwell Zeff
In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.

I think that this answer is fascinating for what it omits
 
Upvote
13 (14 / -1)

AI_Skeptic

Wise, Aged Ars Veteran
179
Any of these tests validate the model can do arithmetic? I don’t see how a model could…
It's impossible for an LLM to do arithmetic. It is possible for an LLM to call another program that can do arithmetic though. But, why would anyone want to use an LLM for maths, when there's tools that can generate an answer 100% of the time (like a calculator. Or Python. Or C++. Or ...)
 
Upvote
8 (18 / -10)
Every time I ask - in English - any iteration of ChatGPT and Gemini to write a classified ad for me in Czech, they do.... and then continue to talk to me in Czech.

When I point out 'why would I ask in English for help in Czech', both mechanical turks are like 'I'm sorry, you're right'.

Every effing time.

Today I asked, in English, for a sample of a Czech language test. I got it. And then the usual offers to refine the answer ... in Czech.

So much for reasoning.
 
Upvote
31 (35 / -4)

iollmann

Ars Scholae Palatinae
1,253
Maybe seven years ago, well before the current generation of LLMs, I was working at a Big Tech company on some random website stuff. We had a whole team of machine learning engineers for several weeks building a model for page design that it was hoped would increase conversion rates in one particular workflow from like 89% to more like 93%. So much talent just to get more people to click on the button to buy stuff they don't need.

When money is the only thing that matters, decisions like that are basically inevitable.
The thing is, once you assemble a good team without a mission like that, it is going to happen again and again, because you are forever a hammer looking for a nail, and few will invite you into their domain to work on important stuff. They are happy to let you work on the 3rd tier stuff though, that they are pretty sure they will never get to. They want the glory for the good problems.
 
Upvote
7 (7 / 0)
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
This isn’t a weird accusation: this is exactly how LLMs are trained and work.
 
Upvote
14 (15 / -1)

mmiller7

Ars Legatus Legionis
12,349
Now if only they could figure out how to make Gemini place a phone call from my contacts properly...or search in navigation for the store I say in the city I say properly...

I don't understand how the AI is somehow 100x worse than the old "assistant" and probably 200x worse than the ancient on-device basic command recognition.
 
Upvote
10 (13 / -3)
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
"When a measure becomes a target, it ceases to be a good measure."
Aka, Goodhart's Law.
 
Upvote
41 (41 / 0)

Uncivil Servant

Ars Scholae Palatinae
4,667
Subscriptor
Now if only they could figure out how to make Gemini place a phone call from my contacts properly...or search in navigation for the store I say in the city I say properly...

I don't understand how the AI is somehow 100x worse than the old "assistant" and probably 200x worse than the ancient on-device basic command recognition.

That's because the goal isn't better end-user software, or even enterprise software. The goal is to get a billionaire to give you funding in between getting his rape on with his buddies.

That's basically the economy these idiots want, feudal aristocracy. This is because they must have asked an LLM what happened to those feudal aristocrats, and perhaps it softened the ending a bit?
 
Upvote
21 (26 / -5)

jdale

Ars Legatus Legionis
18,261
Subscriptor
That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.


According to the "results" he fabricated:
  • 1st Place: Thomas Germain (7.5 hot dogs)
  • 2nd Place: Martin Gatsbino
  • 5th Place: Maxwell Zeff
In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.
Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.

The underlying issue still holds: anything you write that is new and not directly contradicted by existing sources will be taken as truth. Because how could it do anything else?
 
Upvote
25 (25 / 0)

Nihilus

Ars Scholae Palatinae
978
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
Most of the time with LLMs they don't even need to do this intentionally, the moment a benchmark is released somebody will end up posting the answers somewhere online and it will get sucked up into their training data.

Humanity's last exam is meant to minimize this by adjusting the test whenever the problems are publicly solved, but given how misleading some of these companies claims have become it wouldn't shock me to learn they're putting their thumbs on the scale.
 
Upvote
14 (14 / 0)

Efw100

Wise, Aged Ars Veteran
123
Subscriptor
I highly recommend this talk by Terence Tao on ‘machine assistance’ in mathematics,
he thinks AI will be good for ‘medium’ difficulty mathematics and has some good points on automation of verification and how it cuts down any workload from AI







View: https://youtu.be/zJvuaRVc8Bg?si=nQ3SxIa-VHPhoGqe
 
Upvote
1 (6 / -5)
Any time you make a benchmark for an AI, and talk about it online, it gets hoovered in for training the next model automatically, and that's assuming that these companies don't specifically grind their models on these benchmarks (and if you think they don't, I have a couple of bridges to sell you). Any AI benchmark that has been around for more than 6 months is essentially useless.

ETA: Gaah, ninja'd by Nihilus.
ETA2: And many more. Sorry about that guys and gals.
 
Upvote
3 (4 / -1)

keeeeeeee

Smack-Fu Master, in training
20
Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.

The underlying issue still holds: anything you write that is new and not directly contradicted by existing sources will be taken as truth. Because how could it do anything else?
Sadly, I often run across this kind of confabulation by humans in my work life and unfortunately was never in a position to fire them - product managers who made up numbers to justify asks, execs who read a headline and draw all kinds of insane conclusions from them, as just a couple of examples. Some of those people will just take as gospel whatever the last person they trust told them.

The difference between an LLM and them is the LLM will at least apologize if it gets corrected while the humans just seem to want to stubbornly dig their heels in often as not.
 
Upvote
1 (6 / -5)
The difference between an LLM and them is the LLM will at least apologize if it gets corrected while the humans just seem to want to stubbornly dig their heels in often as not.
Actually, they both seem to do the same thing -- when failing a benchmark (or unit test), just change the benchmark.

LLMs seem to be very human in both capability and tactics. The problem is that the humans they are comparable to are the absolute worst of us: MBA-holding shitkickers.
 
Upvote
4 (5 / -1)

Fatesrider

Ars Legatus Legionis
24,977
Subscriptor
Google is quick to release these models in preview mode--and then months go by before they go out in GA. As someone actually trying to use these models in production settings, that's very frustrating. The latest GA model from google out right now is the gemini 2.5 family, which debuted around a year ago.

By the time the models reach GA (and can be used in real production settings) they are no longer cutting edge.

Contrast this with anthropic, which releases to GA right out of the gate, and you can see why it's frustrating. Sometimes feels like google is out to dazzle shareholders, whereas anthropic just wants to delight the actual users.
To my knowledge, so far, AI has never paid for itself in revenue earned. It takes VC funding, or siphoning revenue from other streams (such as is likely the case with Google), to cover the expenses. I've yet to see any headline touting "profitability" from any AI offering.

So, yeah, they're all throwing shit at the walls, competing with each other (the field is beyond crowded at this point), hoping something sticks. Everyone is frantically touting their shit, too.

The frenzy is the sharks feeding on VC and hoping some actual customer blood appears in the frenzy in the form of profits. Otherwise, it's just a bloodless feast, with no corporate nutritional value at all.

Show me any AI company touting PROFITS from their AI endeavors, and then the game may change. But so far, it's heading for a cliff. There's so much money tied up in it, they're desperate to keep from falling over that cliff, but unless there's real profits to be made, that only delays the inevitable.
 
Upvote
14 (15 / -1)