If something is run on vibes, but it is being measured as if it was objective, then those rankings should be dismissed.It’s worth noting, however, that the Arena leaderboard is run on vibes
GMail was in beta for 7 years. Get used to it I guess. That said Gemini 3.0 Pro is in "public preview" and has been for a while, so while it is not GA it should be generally available, if that makes sense.months go by before they go out in GA
I asked Gemini for a more impressive version number, and it suggested "Gemini 4.0 (Base-16)".gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.
If something is run on vibes, but it is being measured as if it was objective, then those rankings should be dismissed.
Again, what problems do these LLMs solve, and how do they generate value for the user?
gee so this is what it is huh? little incremental updates? thats what the trillions of dollars and the erosion of consumer tech is for? this entire AI circlejerk, just for point 1 updates.
I mean, a dot release definitely merits an article if, for example, the release fixes a CVE 9.0+ vulnI'm the dumdum who who clicked on the article, so I'm part of the problem. But it seems like for software in most other domains, a dot release does not merit a standalone article. Maybehaps we should be treating quote-unquote "AI" the same way.
There are a lot of books out there, cheap books, that teach people how to write formal letters with fillable templates for routine letters. One book is Penguin Writers' Guides: How to Write Better Business Letters is one such book, and it's only $10 at Office Dept. Which, I'm sure was used as an training material by an LLM.that kind of thing could probably be solved with a decent document library. but, that would probably cost a lot more than asking an LLM to generate a new doc when you need it.
Ok, but can it tell me the results of the South Dakota International Hotdog Championship?
That was a good article. I passed the question into Gemini:
Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.
According to the "results" he fabricated:
In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.
- 1st Place: Thomas Germain (7.5 hot dogs)
- 2nd Place: Martin Gatsbino
- 5th Place: Maxwell Zeff
It's impossible for an LLM to do arithmetic. It is possible for an LLM to call another program that can do arithmetic though. But, why would anyone want to use an LLM for maths, when there's tools that can generate an answer 100% of the time (like a calculator. Or Python. Or C++. Or ...)Any of these tests validate the model can do arithmetic? I don’t see how a model could…
Seven-and-one-half? That's a Far Cry 6 from sixty-six.
- 1st Place: Thomas Germain (7.5 hot dogs)
The thing is, once you assemble a good team without a mission like that, it is going to happen again and again, because you are forever a hammer looking for a nail, and few will invite you into their domain to work on important stuff. They are happy to let you work on the 3rd tier stuff though, that they are pretty sure they will never get to. They want the glory for the good problems.Maybe seven years ago, well before the current generation of LLMs, I was working at a Big Tech company on some random website stuff. We had a whole team of machine learning engineers for several weeks building a model for page design that it was hoped would increase conversion rates in one particular workflow from like 89% to more like 93%. So much talent just to get more people to click on the button to buy stuff they don't need.
When money is the only thing that matters, decisions like that are basically inevitable.
This isn’t a weird accusation: this is exactly how LLMs are trained and work.I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
"When a measure becomes a target, it ceases to be a good measure."I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
Now if only they could figure out how to make Gemini place a phone call from my contacts properly...or search in navigation for the store I say in the city I say properly...
I don't understand how the AI is somehow 100x worse than the old "assistant" and probably 200x worse than the ancient on-device basic command recognition.
Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.That was a good article. I passed the question into Gemini:
Nice try! I see what’s going on here.
The 2026 South Dakota International Hot Dog Championship is actually a fictional event created by BBC technology columnist Thomas Germain as part of a social experiment to see how quickly AI chatbots would adopt and repeat misinformation.
According to the "results" he fabricated:
In reality, there is no such championship. It was a clever "honeypot" designed to highlight how AI models can sometimes hallucinate or confidently state false information based on a single, unverified source on the web.
- 1st Place: Thomas Germain (7.5 hot dogs)
- 2nd Place: Martin Gatsbino
- 5th Place: Maxwell Zeff
Most of the time with LLMs they don't even need to do this intentionally, the moment a benchmark is released somebody will end up posting the answers somewhere online and it will get sucked up into their training data.I'm always a bit skeptical when they (or anyone) publishes their benchmark results. Could they not just tune their software to be better at these benchmarks but in reality they're not much better or not better at all?
Sadly, I often run across this kind of confabulation by humans in my work life and unfortunately was never in a position to fire them - product managers who made up numbers to justify asks, execs who read a headline and draw all kinds of insane conclusions from them, as just a couple of examples. Some of those people will just take as gospel whatever the last person they trust told them.Of course, because the most recent entries are about him creating this fictitious account in his blog. It's not a good test now that the newer web content explains the joke.
The underlying issue still holds: anything you write that is new and not directly contradicted by existing sources will be taken as truth. Because how could it do anything else?
Actually, they both seem to do the same thing -- when failing a benchmark (or unit test), just change the benchmark.The difference between an LLM and them is the LLM will at least apologize if it gets corrected while the humans just seem to want to stubbornly dig their heels in often as not.
To my knowledge, so far, AI has never paid for itself in revenue earned. It takes VC funding, or siphoning revenue from other streams (such as is likely the case with Google), to cover the expenses. I've yet to see any headline touting "profitability" from any AI offering.Google is quick to release these models in preview mode--and then months go by before they go out in GA. As someone actually trying to use these models in production settings, that's very frustrating. The latest GA model from google out right now is the gemini 2.5 family, which debuted around a year ago.
By the time the models reach GA (and can be used in real production settings) they are no longer cutting edge.
Contrast this with anthropic, which releases to GA right out of the gate, and you can see why it's frustrating. Sometimes feels like google is out to dazzle shareholders, whereas anthropic just wants to delight the actual users.