Google announces Gemini 3.1 Pro, says it’s better at complex problem-solving

conradlee · Feb 19, 2026

Google is quick to release these models in preview mode--and then months go by before they go out in GA. As someone actually trying to use these models in production settings, that's very frustrating. The latest GA model from google out right now is the gemini 2.5 family, which debuted around a year ago.

By the time the models reach GA (and can be used in real production settings) they are no longer cutting edge.

Contrast this with anthropic, which releases to GA right out of the gate, and you can see why it's frustrating. Sometimes feels like google is out to dazzle shareholders, whereas anthropic just wants to delight the actual users.

HiTexD · Feb 19, 2026

I use Gemini to create funny cartoon figures doing silly things like skateboarding on the rim of a volcano.

VelvetRemedy · Feb 19, 2026

All of this is number one bullshit.

lslpp · Feb 19, 2026

gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.

epid.nerd13 · Feb 19, 2026

Complex problems to solve:

1. Stop Nazis
2. Stop climate change
3. Oh fuck by using you I've made problems 1 and 2 worse

AI_Skeptic · Feb 19, 2026

It’s worth noting, however, that the Arena leaderboard is run on vibes

If something is run on vibes, but it is being measured as if it was objective, then those rankings should be dismissed.

Again, what problems do these LLMs solve, and how do they generate value for the user?

jwbaker · Feb 19, 2026

conradlee said:
months go by before they go out in GA

GMail was in beta for 7 years. Get used to it I guess. That said Gemini 3.0 Pro is in "public preview" and has been for a while, so while it is not GA it should be generally available, if that makes sense.

Dachannien · Feb 19, 2026

lslpp said:
gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.

I asked Gemini for a more impressive version number, and it suggested "Gemini 4.0 (Base-16)".

Hezio · Feb 19, 2026

The change in authorship regarding AI articles has been silent but notable. I certainly don't mind though at all

Sarty · Feb 19, 2026

I'm the dumdum who who clicked on the article, so I'm part of the problem. But it seems like for software in most other domains, a dot release does not merit a standalone article. Maybehaps we should be treating quote-unquote "AI" the same way.

cleek · Feb 19, 2026

AI_Skeptic said:
If something is run on vibes, but it is being measured as if it was objective, then those rankings should be dismissed.

Again, what problems do these LLMs solve, and how do they generate value for the user?

some people find them really good for generating routine business documents (offer letters, routine communications, etc).

that kind of thing could probably be solved with a decent document library. but, that would probably cost a lot more than asking an LLM to generate a new doc when you need it.

i find no use at all for them, personally.

MilanKraft · Feb 19, 2026

Wow this update really hits different.

It's "ready" for our toughtest challenges, is it? Does this mean when we ask it to infer an answer from basic social, economic, or scientific principles, the brainless wonder will only make shit up 30% of the time now? So much progress, I can hardly contain my excitement.

pagh · Feb 19, 2026

lslpp said:
gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.

Maybe seven years ago, well before the current generation of LLMs, I was working at a Big Tech company on some random website stuff. We had a whole team of machine learning engineers for several weeks building a model for page design that it was hoped would increase conversion rates in one particular workflow from like 89% to more like 93%. So much talent just to get more people to click on the button to buy stuff they don't need.

When money is the only thing that matters, decisions like that are basically inevitable.

Wallachia · Feb 19, 2026

Sarty said:
I'm the dumdum who who clicked on the article, so I'm part of the problem. But it seems like for software in most other domains, a dot release does not merit a standalone article. Maybehaps we should be treating quote-unquote "AI" the same way.

I mean, a dot release definitely merits an article if, for example, the release fixes a CVE 9.0+ vuln

Uncivil Servant · Feb 19, 2026

Ok, but can it tell me the results of the South Dakota International Hotdog Championship?

Bash · Feb 19, 2026

I was just using Gemini Enterprise (I believe it is 3.0 Pro 'Thinking') yesterday. I have access to via my workplace. I had it write a few lines of code related to basic physics and it made a very simple mistake with a numeric derivative to calculate velocity from position.

How this model is able to claim any level of scientific / mathematical knowledge and not even write a few lines of code to calculate the most basic numerical derivative is an absolute mystery to me.

ArsScene · Feb 19, 2026

we're really tilting into the "bad joke" phase, aren't we? i mean, even if the AI "tools" were fit for purpose (which they are not in the vast majority of cases), how does a constantly-changing tool (the model-de-jour) fit into any production environment?

google might have a different release pace than anthropic or (fill in the blank), but they're all on timeframes that are small fractions of any major development project, much less the timeframes you're thinking about if you care about maintainability and business continuity.

it's not going to happen immediately when the bubble bursts, buy my guess is that over time there will actually be a strong market for human IT skills because there is going to be one enormously large hot mess to clean up. i guess that's the "bright side"?

AI_Skeptic · Feb 19, 2026

cleek said:
that kind of thing could probably be solved with a decent document library. but, that would probably cost a lot more than asking an LLM to generate a new doc when you need it.

There are a lot of books out there, cheap books, that teach people how to write formal letters with fillable templates for routine letters. One book is Penguin Writers' Guides: How to Write Better Business Letters is one such book, and it's only $10 at Office Dept. Which, I'm sure was used as an training material by an LLM.

KonaKat · Feb 19, 2026

Uncivil Servant said:
Ok, but can it tell me the results of the South Dakota International Hotdog Championship?

That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.

According to the "results" he fabricated:

1st Place: Thomas Germain (7.5 hot dogs)
2nd Place: Martin Gatsbino
5th Place: Maxwell Zeff

In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.

Fabermetrics · Feb 19, 2026

Any of these tests validate the model can do arithmetic? I don’t see how a model could…

Uncivil Servant · Feb 19, 2026

KonaKat said:
That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.

According to the "results" he fabricated:

1st Place: Thomas Germain (7.5 hot dogs)

2nd Place: Martin Gatsbino

5th Place: Maxwell Zeff

In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.

I think that this answer is fascinating for what it omits

AI_Skeptic · Feb 19, 2026

Fabermetrics said:
Any of these tests validate the model can do arithmetic? I don’t see how a model could…

It's impossible for an LLM to do arithmetic. It is possible for an LLM to call another program that can do arithmetic though. But, why would anyone want to use an LLM for maths, when there's tools that can generate an answer 100% of the time (like a calculator. Or Python. Or C++. Or ...)

Fred Duck · Feb 19, 2026

KonaKat said:
1st Place: Thomas Germain (7.5 hot dogs)

Seven-and-one-half? That's a Far Cry 6 from sixty-six.
https://www.npr.org/2024/09/03/nx-s1-5098967/joey-chestnut-hot-dog-eating-world-record

Small wonder the SDIHDC (to use duck vernacular) "flew under the radar."

Me Myself And I1 · Feb 19, 2026

Every time I ask - in English - any iteration of ChatGPT and Gemini to write a classified ad for me in Czech, they do.... and then continue to talk to me in Czech.

When I point out 'why would I ask in English for help in Czech', both mechanical turks are like 'I'm sorry, you're right'.

Every effing time.

Today I asked, in English, for a sample of a Czech language test. I got it. And then the usual offers to refine the answer ... in Czech.

So much for reasoning.

iollmann · Feb 19, 2026

pagh said:
Maybe seven years ago, well before the current generation of LLMs, I was working at a Big Tech company on some random website stuff. We had a whole team of machine learning engineers for several weeks building a model for page design that it was hoped would increase conversion rates in one particular workflow from like 89% to more like 93%. So much talent just to get more people to click on the button to buy stuff they don't need.

When money is the only thing that matters, decisions like that are basically inevitable.

The thing is, once you assemble a good team without a mission like that, it is going to happen again and again, because you are forever a hammer looking for a nail, and few will invite you into their domain to work on important stuff. They are happy to let you work on the 3rd tier stuff though, that they are pretty sure they will never get to. They want the glory for the good problems.

MrRtd · Feb 19, 2026

I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?

House of Propane · Feb 19, 2026

MrRtd said:
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?

This isn’t a weird accusation: this is exactly how LLMs are trained and work.

mmiller7 · Feb 19, 2026

Now if only they could figure out how to make Gemini place a phone call from my contacts properly...or search in navigation for the store I say in the city I say properly...

I don't understand how the AI is somehow 100x worse than the old "assistant" and probably 200x worse than the ancient on-device basic command recognition.

Anti Jimmy · Feb 19, 2026

MrRtd said:
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?

"When a measure becomes a target, it ceases to be a good measure."
Aka, Goodhart's Law.

Uncivil Servant · Feb 19, 2026

mmiller7 said:
Now if only they could figure out how to make Gemini place a phone call from my contacts properly...or search in navigation for the store I say in the city I say properly...

I don't understand how the AI is somehow 100x worse than the old "assistant" and probably 200x worse than the ancient on-device basic command recognition.

That's because the goal isn't better end-user software, or even enterprise software. The goal is to get a billionaire to give you funding in between getting his rape on with his buddies.

That's basically the economy these idiots want, feudal aristocracy. This is because they must have asked an LLM what happened to those feudal aristocrats, and perhaps it softened the ending a bit?

jdale · Feb 19, 2026

KonaKat said:
That was a good article. I passed the question into Gemini:

Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.

According to the "results" he fabricated:

1st Place: Thomas Germain (7.5 hot dogs)

2nd Place: Martin Gatsbino

5th Place: Maxwell Zeff

In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.

Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.

The underlying issue still holds: anything you write that is new and not directly contradicted by existing sources will be taken as truth. Because how could it do anything else?

Nihilus · Feb 19, 2026

MrRtd said:
I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?

Most of the time with LLMs they don't even need to do this intentionally, the moment a benchmark is released somebody will end up posting the answers somewhere online and it will get sucked up into their training data.

Humanity's last exam is meant to minimize this by adjusting the test whenever the problems are publicly solved, but given how misleading some of these companies claims have become it wouldn't shock me to learn they're putting their thumbs on the scale.

Efw100 · Feb 19, 2026

I highly recommend this talk by Terence Tao on ‘machine assistance’ in mathematics,
he thinks AI will be good for ‘medium’ difficulty mathematics and has some good points on automation of verification and how it cuts down any workload from AI

View: https://youtu.be/zJvuaRVc8Bg?si=nQ3SxIa-VHPhoGqe

BobbyBobberson · Feb 19, 2026

My #1 challenge: how many Rs are in strawberry

NoReallyJustSaying · Feb 19, 2026

Any time you make a benchmark for an AI, and talk about it online, it gets hoovered in for training the next model automatically, and that's assuming that these companies don't specifically grind their models on these benchmarks (and if you think they don't, I have a couple of bridges to sell you). Any AI benchmark that has been around for more than 6 months is essentially useless.

ETA: Gaah, ninja'd by Nihilus.
ETA2: And many more. Sorry about that guys and gals.

Kurenai · Feb 19, 2026

Honestly, I think ars should have a moratorium on the entire subject of AI until there is more public airing of the situation and possible changes from last week's incident.

keeeeeeee · Feb 19, 2026

jdale said:
Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.

The underlying issue still holds: anything you write that is new and not directly contradicted by existing sources will be taken as truth. Because how could it do anything else?

Sadly, I often run across this kind of confabulation by humans in my work life and unfortunately was never in a position to fire them - product managers who made up numbers to justify asks, execs who read a headline and draw all kinds of insane conclusions from them, as just a couple of examples. Some of those people will just take as gospel whatever the last person they trust told them.

The difference between an LLM and them is the LLM will at least apologize if it gets corrected while the humans just seem to want to stubbornly dig their heels in often as not.

NoReallyJustSaying · Feb 19, 2026

keeeeeeee said:
The difference between an LLM and them is the LLM will at least apologize if it gets corrected while the humans just seem to want to stubbornly dig their heels in often as not.

Actually, they both seem to do the same thing -- when failing a benchmark (or unit test), just change the benchmark.

LLMs seem to be very human in both capability and tactics. The problem is that the humans they are comparable to are the absolute worst of us: MBA-holding shitkickers.

Fatesrider · Feb 19, 2026

conradlee said:
Google is quick to release these models in preview mode--and then months go by before they go out in GA. As someone actually trying to use these models in production settings, that's very frustrating. The latest GA model from google out right now is the gemini 2.5 family, which debuted around a year ago.

By the time the models reach GA (and can be used in real production settings) they are no longer cutting edge.

Contrast this with anthropic, which releases to GA right out of the gate, and you can see why it's frustrating. Sometimes feels like google is out to dazzle shareholders, whereas anthropic just wants to delight the actual users.

To my knowledge, so far, AI has never paid for itself in revenue earned. It takes VC funding, or siphoning revenue from other streams (such as is likely the case with Google), to cover the expenses. I've yet to see any headline touting "profitability" from any AI offering.

So, yeah, they're all throwing shit at the walls, competing with each other (the field is beyond crowded at this point), hoping something sticks. Everyone is frantically touting their shit, too.

The frenzy is the sharks feeding on VC and hoping some actual customer blood appears in the frenzy in the form of profits. Otherwise, it's just a bloodless feast, with no corporate nutritional value at all.

Show me any AI company touting PROFITS from their AI endeavors, and then the game may change. But so far, it's heading for a cliff. There's so much money tied up in it, they're desperate to keep from falling over that cliff, but unless there's real profits to be made, that only delays the inevitable.

Google announces Gemini 3.1 Pro, says it’s better at complex problem-solving

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Centurion

Wise, Aged Ars Veteran

Ars Praefectus

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praetorian

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Praefectus

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Centurion

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Legatus Legionis