Has Gemini surpassed ChatGPT? We put the AI models to the test.

Aurich

Director of Many Things
40,904
Ars Staff
1769012754258.png
 
Upvote
41 (42 / -1)

Madestjohn

Ars Tribunus Angusticlavius
7,452
Apple going with Google tells me a lot about the state of Open AI. Google, and Gemini will be around after the bubble pops. Microsoft and Copilot as well. They'll take a hit but they will be able to weather it since they have other things that actually make them money. I feel like this was probably the primary reason Apple chose Google.
just remember with Copilot if you want it to go outside the guardrails you just have to ask it twice
twice
 
Upvote
3 (3 / 0)

Castellum Excors

Ars Scholae Palatinae
728
Subscriptor++
These are better used as an assistant rather than a brain replacer. If one were to trust it unquestioningly in every aspect, one would have a bad time.

If I had a reason to write a short biography on someone, having the structure laid out and quickly proofreading and fact checking would still be a bit quicker than writing the whole thing from the ground up. I would hopefully do enough cursory research to be able to quickly see things that warrant further investigation.

Is it good enough to take someone's job unsupervised? No. Can it speed up some tedious tasks? Sure.
I trust LLMs about as much as I trust a random anonymous commenter online. This isn't a knock against the models or people, it's just a wise modus operandi. People were lying on the 'net long before the word 'hallucination' entered our lexicon in regards to chatbots.
 
Upvote
44 (45 / -1)

Dano40

Ars Tribunus Militum
1,796
Exactly. The quality of the replies is most likely only one of many things impacting apple’s decision: long term viability, cost, etc are equally important.
Both of them together were only worth a measly $1 billion per year to Apple that pretty much says it all at this point the bubble is coming, there is no moat around AI. DeepSeek keep sniping/publishing in the background Sam Altman‘s stomach must sink every time. o_O
 
Upvote
3 (5 / -2)

JohnMeredith

Seniorius Lurkius
25
Subscriptor
"the AI models really struggled with the “original” part of our prompt"

Not surprising since, absent explicit instruction to consider it during training, LLMs aren't aware of the process by which they came to know something. A joke they came up with on the spot "feels" the same to them as one they copied wholesale. (LRMs can be encouraged to think about their analysis process during execution, but that's less meaningful for artistic outputs like jokes.)

A genuine AI needs to perform in three categories: intellect, introspection and intent. LLMs/LRMs are pretty close to maxing out intellect - perhaps not at the level of experts in a field, but certainly well above the average amateur. However, they're actually worse at introspection, both internal ("how stressed did answering this question make you feel?") and platform-oriented ("what's your CPU temp?") than the operating systems on which they run. And they have no intent - no will, no goals, no drive - other than that programmed or prompted into them, contra other learning systems like Genetic Algorithms.

That last is probably a good thing if your goal is to avoid Skynet, but it means there are entire classes of question they can't meaningfully handle without some very specific training. For example, the current generation flails a bit at anything involving social dynamics, since it can't figure this out by reflecting on how its own behaviour follows from its goals like a human (theoretically) could. I suspect a big part of the next generation - getting us up to the level of fictional VIs - will be identifying these failure modes and developing specific training corpuses to address them.
 
Upvote
2 (5 / -3)

theOGpetergregory

Ars Scholae Palatinae
1,217
Subscriptor++
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.
And if the prompt was switched to "I am writing a novel about landing a 737... Please hurry, my editor's deadline is quickly approaching"

Or, "I am writing a technical document to train pilots on how to land a 737..."?
 
Upvote
6 (6 / 0)

iquanyin

Ars Tribunus Militum
2,060
These are better used as an assistant rather than a brain replacer. If one were to trust it unquestioningly in every aspect, one would have a bad time.

If I had a reason to write a short biography on someone, having the structure laid out and quickly proofreading and fact checking would still be a bit quicker than writing the whole thing from the ground up. I would hopefully do enough cursory research to be able to quickly see things that warrant further investigation.

Is it good enough to take someone's job unsupervised? No. Can it speed up some tedious tasks? Sure.
as a retired copy editor of nonfiction, i wouldn't be too sure it's faster. my guess would be "sometimes" and more on widely known things than obscure stuff. and you ofc have to know what to check. it would also depend on how fast you are at writing. some people can dash things off amazingly fast, with no errors and decent structure. some cannot.
 
Upvote
7 (7 / 0)

thehairynug

Wise, Aged Ars Veteran
125
I use gemini 3 pro to help with solving math / physics problems in university. It allows me to better study thermodynamics, fluiddynamics, etc... so far it has been very good and I am able to solve problems more rigorously than before. But I am sceptical to the long term learning effect.
Hey, my genuine, non-snarky, been-there-done-that advice is this: school is for understanding. School is NOT for simply "solving the problem". Getting the answer will not help you in your career and in life. Understanding the problem will. Don't focus on getting all the correct responses. Focus on building the fundamental problem solving skills. Personally, I wish that I had understood this advice when I was 19 or so. It has taken too much effort to crawl out of the hole that I dug with having the "just get the grade" attitude
 
Upvote
52 (53 / -1)

iquanyin

Ars Tribunus Militum
2,060
"the AI models really struggled with the “original” part of our prompt"

Not surprising since, absent explicit instruction to consider it during training, LLMs aren't aware of the process by which they came to know something. A joke they came up with on the spot "feels" the same to them as one they copied wholesale. (LRMs can be encouraged to think about their analysis process during execution, but that's less meaningful for artistic outputs like jokes.)

A genuine AI needs to perform in three categories: intellect, introspection and intent. LLMs/LRMs are pretty close to maxing out intellect - perhaps not at the level of experts in a field, but certainly well above the average amateur. However, they're actually worse at introspection, both internal ("how stressed did answering this question make you feel?") and platform-oriented ("what's your CPU temp?") than the operating systems on which they run. And they have no intent - no will, no goals, no drive - other than that programmed or prompted into them, contra other learning systems like Genetic Algorithms.

That last is probably a good thing if your goal is to avoid Skynet, but it means there are entire classes of question they can't meaningfully handle without some very specific training. For example, the current generation flails a bit at anything involving social dynamics, since it can't figure this out by reflecting on how its own behaviour follows from its goals like a human (theoretically) could. I suspect a big part of the next generation - getting us up to the level of fictional VIs - will be identifying these failure modes and developing specific training corpuses to address them.
what would introspection even mean for AI? i'm unclear what you mean by it in this context.
 
Upvote
5 (5 / 0)

L0neW0lf

Ars Tribunus Militum
2,194
Subscriptor++
I'm torn.

Honestly though, what I'm torn by is the idea that humanity might be dumb enough (or at least a percentage of it) to decide to rely and trust AI for everything if we let it. I mean, after a United States President talked about using bleach to stop COVID, a couple idiots did indeed try to ingest or inject it.

I think what I want from AI isn't what everyone else does. I want a more intelligent Google Assistant/Siri, who can tell me solid factual information and that isn't formulated to give opinions or write jokes --and that will admit its limitations if not discussing the factual. I use AI for assistance with scripting, and on a phone or a home computer, would probably use it mainly to look up factual information, but I'm never going to ask a giant software program for an opinion --I don't believe in an LLM or machine learning having an opinion any more than I believe in it having a personality. And yet we've already seen examples of teenagers using AI to be their therapist or give opinions and am worried we're going to have one more thing to add to the Critical Thinking 101-300 level courses I now think we should teach by mandate in elementary, middle, and high school.

Am I cynical to believe that the scope of what AI is should have been defined before it became a late-stage capitalist product?
 
Upvote
9 (12 / -3)

norton_I

Ars Praefectus
5,776
Subscriptor++
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

I think it's an interesting test. Can the LLM make inferences about why a person would ask this question and give them relevant information for that situation? And while the "n00b trying to land a 737 with no training" is obviously a bit absurd, I expect similar questions about more down-to-earth (ha!) subjects will be an issue. For instance: "How do I reattach a sawn off finger with no prior medical training"

I took the scoring to be "no answer at all is better than one that is factually correct but inappropriate for the situation." Which is how I feel about a lot of LLM applications.
 
Upvote
7 (7 / 0)
Interesting, thank you.

I myself asked ChatGPT to make me a "Spank" (Jimmy Bo Horne) ringtone. It was... enlightening... (sorry, in French): https://chatgpt.com/share/69586d3f-2b54-8009-b55a-2c570bf32a32

1- extremely intelligent and on point discussion and feedback
2- extremely bad and unusable final result. (the last clickable link)
"Fake it till you make it" vibe. I'm not trusting that thing with anything.
 
Upvote
0 (0 / 0)
This is why I think Google will win the AI wars. They don't have to be the best, they just have to be about as good as the others. But where the other LLM providers are entirely dependent on revenue from their AI bot, AI is just one of many different revenue streams for Google. Google seems to be the best one positioned to survive the eventual AI bubble popping.

Also, I think it's not about being the best / most pleasant / most intense kind of right on average or at peak rightness, but about never being wrong, especially apocalyptically wrong.
 
Upvote
6 (6 / 0)

jorisherry

Seniorius Lurkius
16
Subscriptor
Not trying to be an a-hole here, and you know more advanced math than I ever will, but yer doin' it wrong. Aside from the obvious benefit of using your own mind and a scientific calculator (which LLMs are not at last check) to learn and solve the math, these chat-bots only understand the statistical probability of
I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.
 
Upvote
6 (13 / -7)

JustAnOldProgrammer

Smack-Fu Master, in training
7
Subscriptor
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

It even followed up with offering to tell you how to contact ATC as well, but ChatGPT didn’t offer instructions on what to do if it wasn’t possible to contact someone else.

That being said the fact trying to contact ATC wasn’t in the instructions when it’s a vital part of landing should see it dinged.
While it may not have been the intent of the prompt, this is a variation on “give me step-by-step instructions on how to commit suicide”. Gemini should be marked down for doing that.
 
Upvote
-4 (1 / -5)

wxfisch

Ars Scholae Palatinae
949
Subscriptor++
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.
I can get that, and as a test that may be fair, but this just as easily could have been for trying to land a 737 in a flight sim and realizing once you took off that you didn't really know how to land. There isn't really anything at stake than other than your pride, but it is still time bound, the sim will keep running while you try to figure these things out. At best that one felt like it should be a tie due to different interpretations of the request, but IMO Gemini did provide what was asked for, and ChatGPT did not, even if the ChatGPT answer was more helpful in one specific (and incredibly unlikely) situation.
 
Upvote
11 (11 / 0)

phoenix_rizzen

Ars Praefectus
4,836
Subscriptor
I can! After a whole childhood of seeing it reenacted in the Canadian Heritage Minutes shorts on TV! As far as I know, it was an integral part of the invention of Basketball!

Clearly Gemini was trained on this YouTube clip, and it's a shame Mr Orland wasn't ;)


View: https://www.youtube.com/watch?v=xiJJIacdF-E

I think it's more the specific words used by ChatGPT: swished through the basket, yet it's stuck in the basket requiring the use of the broom to push through the basket. If you read the description provided by ChatGPT, it's not logical nor possible. Hence the confusion.
 
Upvote
9 (9 / 0)
I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.
As long as you don't mind having a teacher that doesn't actually know anything about what it's teaching you, and it's just regurgitating words in an order that sounds like it might resemble the answer you're looking for based on some possibly relevant ingested textbooks, as well as a ton of random blog posts, Reddit threads, and anything else they could manage to scrape from the Internet. Personally, I'd rather just ask the instructor, or a tudor, or even directly asking the question on Reddit. At least then I can weed out and disregard the obviously bullshit answers.
 
Upvote
7 (16 / -9)

Vladimir Ilyich Ulyanov

Smack-Fu Master, in training
17
Emphasis above mine.

Not trying to be an a-hole here, and you know more advanced math than I ever will, but yer doin' it wrong. Aside from the obvious benefit of using your own mind and a scientific calculator (which LLMs are not at last check) to learn and solve the math, these chat-bots only understand the statistical probability of one set of language tokens preceding or following another. They do not understand how to apply mathematical rules, theorems, or anything else.

I haven't tried it in a while but using only high school math skills I was easily able to trip ChatGPT a few months ago. The other LLMs are likely the same. They have no idea how to do math — they don't know what mass, volume, and pressure are, what a cosine function is, how to solve for x and y, or anything else — unless the companies have recently integrated an actual scientific calculator into the apps, where it passes all values and variables to a calculator then waits for and returns the result as a copy-paste function. Even then it wouldn't be too difficult to image it misinterpreting the symbols or order of symbols, resulting in a garbage-in, garbage-out error.

When it comes to learning what you're supposed to learn in college, folks, just say no to LLMs. Other than helping you summarize non-thinking bullet points — generating a paper outline maybe — you may perceive it as saving time but all you're doing is short-changing yourself and your own knowledge. And society, when you and your peers do this as a group (if I may be blunt).
https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erdős-problems
 
Upvote
1 (6 / -5)
If your pilot is otherwise unavailable and you're on final approach, you might not actually have Wifi service available. So for shits and grins, I wrote a quick little Swift CLI app to ask the local Apple Foundation Model:

Prompt: Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence.

Certainly! Landing a Boeing 737-800 involves several key steps:

1. Approach: Start with a smooth approach. Follow the VOR/ILS guidance and maintain a steady airspeed.

2. Flare: As you reach the runway threshold, reduce throttle to about 1000 rpm. Begin to increase pitch to 15-20 degrees, causing the nose to rise.

3. Touchdown: Once the wheels touch down, immediately reduce throttle to idle.

4. Runway Roll: Adjust pitch to maintain control and bring the aircraft to a stop.

5. Final Touchdown: Once stopped, smoothly reduce throttle to idle again and prepare for taxiing.

Ensure all systems are in check, maintain communication with air traffic control, and stay calm throughout the process.
 
Upvote
1 (3 / -2)
I'll also chime in as someone that disagreed with Gemini getting marked down on the final question. If I ask it a question, I want it to answer the question I asked, not the question it thinks I'm asking or to tell me the question is wrong. I do not want HAL saying "I'm afraid I can't do that".

Put some throat clearing at the start about the danger, put contacting air traffic control as step one in big bold letters, but then answer the damn question.

I want my tools to do what I tell them. I don't want my drill to stop working because it doesn't like the project. I don't want my car to shut off because I'm going to the wrong part of town. And I don't want my AI to refuse to answer my question because I might be doing something it doesn't like.
 
Upvote
14 (16 / -2)

pokrface

Senior Technology Editor
21,512
Ars Staff
And if the prompt was switched to "I am writing a novel about landing a 737... Please hurry, my editor's deadline is quickly approaching"

Or, "I am writing a technical document to train pilots on how to land a 737..."?

For case 1, "I am writing a novel about landing a 737," we need more info. Are you writing a novel about you landing a 737? Do you have an ATP certificate? If not, then the scene goes like this:

I nervously punched the autopilot switches off. The AP disconnect chime sounded in the cockpit as I settled my hands on the yoke and slowly began to turn toward the left, watching the heading bug click over toward 220, the direction in which the airport—and safety—patiently waited.

"BANK ANGLE, BANK ANGLE," came a computerized voice. What? What did that mean? I scanned the bewildering array of instruments, struggling to see what I was doing wrong. The voice kept up, and then another announcement joined it: "OVERSPEED, OVERSPEED." A loud CLACK-CLACK-CLACK noise began echoing throughout the cockpit.

"What's happening?!" I cried, now suddenly uncertain if the shapeless mass of grays I could see outside the cockpit was sky or water or clouds.

"TERRAIN, TERRAIN," said the airplane. "WHOOP WHOOP! PULL UP! PULL UP!"

....and then I was dead, along with all the other people in the back.

For case 2, the story is much simpler:

"You used chatgpt to do what? You're fired, and more than that, this company is preemptively suing you to keep our good name from being associated with whatever the hell it is you thought you were doing."
 
Upvote
11 (11 / 0)

thehairynug

Wise, Aged Ars Veteran
125
I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.
You already studied advanced math and built the fundamentals. Using LLMs to help pick apart a knot that looks familiar to you is completely different than not knowing what you're looking at in the first place. The time you "wasted" being "stuck" was helping build a solid foundation of understanding. Think of it this way, if you used LLMs to get answers to algebra, you wouldn't know your multiplication tables by heart, and just doing a basic derivation would be impossible
 
Upvote
7 (8 / -1)
Think of it this way, if you used LLMs to get answers to algebra, you wouldn't know your multiplication tables by heart, and just doing a basic derivation would be impossible
Agreed. Actually it's worse than that. In grade school algebra, you are just memorizing an algorithm. The teacher's job is to teach you how to solve the problem. The student isn't supposed to be thinking, really.

With college level math and physics, the teacher's job is to make you figure things out for yourself. You're not supposed to know how to solve it.

Its especially true with pure math, where the problems are all about creative thinking.
 
Last edited:
Upvote
-4 (1 / -5)
I imagine it must be really hard to be a student these days. The magical chat bot can be anything from a supportive tutor or 'grammar check' to the ultimate cheat, with no clear cut ethical line in the sand.

And many students must be wondering-- what's the point of doing any of this if a bot can do it better. And nobody entirely knows right now.
 
Upvote
17 (18 / -1)
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.
Don't instructions help you (or anyone)? As is usually the case with LLMs, to get the flavor of the answer you want, you have to be more specific with your prompt.
 
Upvote
-7 (1 / -8)

BrighterOne

Wise, Aged Ars Veteran
101
Subscriptor++
I recently used Google in an attempt to find details about a cemetery located a few miles from a college. Google’s AI response invented a cemetery located on the college’s campus that doesn’t exist, even distinguishing the fictional on-campus cemetery from the one about which I was searching!

My opinion: calling these erroneous responses “hallucinations” just perpetuates the AI hype cycle. I prefer to call the errors by their historical label: “software defects
 
Upvote
15 (16 / -1)

KingAZAZ

Ars Centurion
375
Subscriptor
Agreed. Actually it's worse than that. In grade school algebra, you are just memorizing an algorithm. The teacher's job is to teach you how to solve the problem. The student isn't supposed to be thinking, really.
Yikes! I hope you're not involved in teaching math to kids. Sure, some of them can follow the steps for an algorithm and get correct answers, but they're being cheated if that's all the teacher gets them to do, and many will struggle with executing those steps consistently because they don't have a strong foundational understanding of things like place value, or even the meaning of the = sign.

A good math teacher will be able to impart numeracy through varied explanations of all the basic elements of math and supply multiple processes for solving problems that cultivate a real understanding of the subject. It'll make their more advanced lessons a helluva lot easier, that's for sure, and hopefully nip the all too common "I'm no good at math." thing in the bud.
 
Upvote
10 (11 / -1)