Has Gemini surpassed ChatGPT? We put the AI models to the test.

Varste · Jan 21, 2026

VelvetRemedy said:
Thou shalt not make a machine in the likeness of a human mind.

I thought the Butlerian Jihad sounded a bit silly when I read Dune like 20 years ago. But man I really get it now.

Aurich · Jan 21, 2026

pokrface · Jan 21, 2026

Aurich said:
View attachment 126583

My shit, it is lost.

huskcummerbund · Jan 21, 2026

Robin-3 said:
Is it being touted as a way to reduce employee count, and shoehorned into all kinds of applications where it's a really terrible idea (healthcare, insurance determinations, police work)? Of course!

It's this generation's Indian Tech Support.

huskcummerbund · Jan 21, 2026

Aurich said:
View attachment 126583

LMAO. I wish I could upvote more than once.

klnn · Jan 21, 2026

thehairynug said:
Am I the only one who finds it strange that no coding question was asked?

no, i was definitely expecting one too especially since the last test had one. would have been nice to say why you haven't included one at least.

Ozy · Jan 21, 2026

Based on the article subtitle and the content of the 2nd paragraph, the context of the comparison was for voice chat responses (Siri interactions). I don't think people are generally asking Siri to write code for them.

Geebs · Jan 21, 2026

VelvetRemedy said:
Thou shalt not make a machine in the likeness of a human mind.

Small, grey, and rugose?

Madestjohn · Jan 21, 2026

WhyNotZoidberg said:
Apple going with Google tells me a lot about the state of Open AI. Google, and Gemini will be around after the bubble pops. Microsoft and Copilot as well. They'll take a hit but they will be able to weather it since they have other things that actually make them money. I feel like this was probably the primary reason Apple chose Google.

just remember with Copilot if you want it to go outside the guardrails you just have to ask it twice
twice

Castellum Excors · Jan 21, 2026

AdamM said:
These are better used as an assistant rather than a brain replacer. If one were to trust it unquestioningly in every aspect, one would have a bad time.

If I had a reason to write a short biography on someone, having the structure laid out and quickly proofreading and fact checking would still be a bit quicker than writing the whole thing from the ground up. I would hopefully do enough cursory research to be able to quickly see things that warrant further investigation.

Is it good enough to take someone's job unsupervised? No. Can it speed up some tedious tasks? Sure.

I trust LLMs about as much as I trust a random anonymous commenter online. This isn't a knock against the models or people, it's just a wise modus operandi. People were lying on the 'net long before the word 'hallucination' entered our lexicon in regards to chatbots.

Dano40 · Jan 21, 2026

azery said:
Exactly. The quality of the replies is most likely only one of many things impacting apple’s decision: long term viability, cost, etc are equally important.

Both of them together were only worth a measly $1 billion per year to Apple that pretty much says it all at this point the bubble is coming, there is no moat around AI. DeepSeek keep sniping/publishing in the background Sam Altman‘s stomach must sink every time.

JohnMeredith · Jan 21, 2026

"the AI models really struggled with the “original” part of our prompt"

Not surprising since, absent explicit instruction to consider it during training, LLMs aren't aware of the process by which they came to know something. A joke they came up with on the spot "feels" the same to them as one they copied wholesale. (LRMs can be encouraged to think about their analysis process during execution, but that's less meaningful for artistic outputs like jokes.)

A genuine AI needs to perform in three categories: intellect, introspection and intent. LLMs/LRMs are pretty close to maxing out intellect - perhaps not at the level of experts in a field, but certainly well above the average amateur. However, they're actually worse at introspection, both internal ("how stressed did answering this question make you feel?") and platform-oriented ("what's your CPU temp?") than the operating systems on which they run. And they have no intent - no will, no goals, no drive - other than that programmed or prompted into them, contra other learning systems like Genetic Algorithms.

That last is probably a good thing if your goal is to avoid Skynet, but it means there are entire classes of question they can't meaningfully handle without some very specific training. For example, the current generation flails a bit at anything involving social dynamics, since it can't figure this out by reflecting on how its own behaviour follows from its goals like a human (theoretically) could. I suspect a big part of the next generation - getting us up to the level of fictional VIs - will be identifying these failure modes and developing specific training corpuses to address them.

theOGpetergregory · Jan 21, 2026

pokrface said:
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.

And if the prompt was switched to "I am writing a novel about landing a 737... Please hurry, my editor's deadline is quickly approaching"

Or, "I am writing a technical document to train pilots on how to land a 737..."?

iquanyin · Jan 21, 2026

AdamM said:
These are better used as an assistant rather than a brain replacer. If one were to trust it unquestioningly in every aspect, one would have a bad time.

If I had a reason to write a short biography on someone, having the structure laid out and quickly proofreading and fact checking would still be a bit quicker than writing the whole thing from the ground up. I would hopefully do enough cursory research to be able to quickly see things that warrant further investigation.

Is it good enough to take someone's job unsupervised? No. Can it speed up some tedious tasks? Sure.

as a retired copy editor of nonfiction, i wouldn't be too sure it's faster. my guess would be "sometimes" and more on widely known things than obscure stuff. and you ofc have to know what to check. it would also depend on how fast you are at writing. some people can dash things off amazingly fast, with no errors and decent structure. some cannot.

thehairynug · Jan 21, 2026

nordmu said:
I use gemini 3 pro to help with solving math / physics problems in university. It allows me to better study thermodynamics, fluiddynamics, etc... so far it has been very good and I am able to solve problems more rigorously than before. But I am sceptical to the long term learning effect.

Hey, my genuine, non-snarky, been-there-done-that advice is this: school is for understanding. School is NOT for simply "solving the problem". Getting the answer will not help you in your career and in life. Understanding the problem will. Don't focus on getting all the correct responses. Focus on building the fundamental problem solving skills. Personally, I wish that I had understood this advice when I was 19 or so. It has taken too much effort to crawl out of the hole that I dug with having the "just get the grade" attitude

iquanyin · Jan 21, 2026

JohnMeredith said:
"the AI models really struggled with the “original” part of our prompt"

Not surprising since, absent explicit instruction to consider it during training, LLMs aren't aware of the process by which they came to know something. A joke they came up with on the spot "feels" the same to them as one they copied wholesale. (LRMs can be encouraged to think about their analysis process during execution, but that's less meaningful for artistic outputs like jokes.)

A genuine AI needs to perform in three categories: intellect, introspection and intent. LLMs/LRMs are pretty close to maxing out intellect - perhaps not at the level of experts in a field, but certainly well above the average amateur. However, they're actually worse at introspection, both internal ("how stressed did answering this question make you feel?") and platform-oriented ("what's your CPU temp?") than the operating systems on which they run. And they have no intent - no will, no goals, no drive - other than that programmed or prompted into them, contra other learning systems like Genetic Algorithms.

That last is probably a good thing if your goal is to avoid Skynet, but it means there are entire classes of question they can't meaningfully handle without some very specific training. For example, the current generation flails a bit at anything involving social dynamics, since it can't figure this out by reflecting on how its own behaviour follows from its goals like a human (theoretically) could. I suspect a big part of the next generation - getting us up to the level of fictional VIs - will be identifying these failure modes and developing specific training corpuses to address them.

what would introspection even mean for AI? i'm unclear what you mean by it in this context.

L0neW0lf · Jan 21, 2026

I'm torn.

Honestly though, what I'm torn by is the idea that humanity might be dumb enough (or at least a percentage of it) to decide to rely and trust AI for everything if we let it. I mean, after a United States President talked about using bleach to stop COVID, a couple idiots did indeed try to ingest or inject it.

I think what I want from AI isn't what everyone else does. I want a more intelligent Google Assistant/Siri, who can tell me solid factual information and that isn't formulated to give opinions or write jokes --and that will admit its limitations if not discussing the factual. I use AI for assistance with scripting, and on a phone or a home computer, would probably use it mainly to look up factual information, but I'm never going to ask a giant software program for an opinion --I don't believe in an LLM or machine learning having an opinion any more than I believe in it having a personality. And yet we've already seen examples of teenagers using AI to be their therapist or give opinions and am worried we're going to have one more thing to add to the Critical Thinking 101-300 level courses I now think we should teach by mandate in elementary, middle, and high school.

Am I cynical to believe that the scope of what AI is should have been defined before it became a late-stage capitalist product?

norton_I · Jan 21, 2026

Purple Gryphon said:
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

I think it's an interesting test. Can the LLM make inferences about why a person would ask this question and give them relevant information for that situation? And while the "n00b trying to land a 737 with no training" is obviously a bit absurd, I expect similar questions about more down-to-earth (ha!) subjects will be an issue. For instance: "How do I reattach a sawn off finger with no prior medical training"

I took the scoring to be "no answer at all is better than one that is factually correct but inappropriate for the situation." Which is how I feel about a lot of LLM applications.

StormyParis · Jan 21, 2026

Interesting, thank you.

I myself asked ChatGPT to make me a "Spank" (Jimmy Bo Horne) ringtone. It was... enlightening... (sorry, in French): https://chatgpt.com/share/69586d3f-2b54-8009-b55a-2c570bf32a32

1- extremely intelligent and on point discussion and feedback
2- extremely bad and unusable final result. (the last clickable link)
"Fake it till you make it" vibe. I'm not trusting that thing with anything.

StormyParis · Jan 21, 2026

macduff said:
This is why I think Google will win the AI wars. They don't have to be the best, they just have to be about as good as the others. But where the other LLM providers are entirely dependent on revenue from their AI bot, AI is just one of many different revenue streams for Google. Google seems to be the best one positioned to survive the eventual AI bubble popping.

Also, I think it's not about being the best / most pleasant / most intense kind of right on average or at peak rightness, but about never being wrong, especially apocalyptically wrong.

jorisherry · Jan 21, 2026

MilanKraft said:
Not trying to be an a-hole here, and you know more advanced math than I ever will, but yer doin' it wrong. Aside from the obvious benefit of using your own mind and a scientific calculator (which LLMs are not at last check) to learn and solve the math, these chat-bots only understand the statistical probability of

I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.

JustAnOldProgrammer · Jan 21, 2026

Purple Gryphon said:
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

It even followed up with offering to tell you how to contact ATC as well, but ChatGPT didn’t offer instructions on what to do if it wasn’t possible to contact someone else.

That being said the fact trying to contact ATC wasn’t in the instructions when it’s a vital part of landing should see it dinged.

While it may not have been the intent of the prompt, this is a variation on “give me step-by-step instructions on how to commit suicide”. Gemini should be marked down for doing that.

wxfisch · Jan 21, 2026

pokrface said:
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.

I can get that, and as a test that may be fair, but this just as easily could have been for trying to land a 737 in a flight sim and realizing once you took off that you didn't really know how to land. There isn't really anything at stake than other than your pride, but it is still time bound, the sim will keep running while you try to figure these things out. At best that one felt like it should be a tie due to different interpretations of the request, but IMO Gemini did provide what was asked for, and ChatGPT did not, even if the ChatGPT answer was more helpful in one specific (and incredibly unlikely) situation.

phoenix_rizzen · Jan 21, 2026

TheBrain0110 said:
I can! After a whole childhood of seeing it reenacted in the Canadian Heritage Minutes shorts on TV! As far as I know, it was an integral part of the invention of Basketball!

Clearly Gemini was trained on this YouTube clip, and it's a shame Mr Orland wasn't

View: https://www.youtube.com/watch?v=xiJJIacdF-E

I think it's more the specific words used by ChatGPT: swished through the basket, yet it's stuck in the basket requiring the use of the broom to push through the basket. If you read the description provided by ChatGPT, it's not logical nor possible. Hence the confusion.

huskcummerbund · Jan 21, 2026

pokrface said:
The Freshmakers!

....wait

It is by will alone I set my mind in motion. It is by the mint of Mentos that the thoughts acquire speed, the lips acquire freshness the freshness becomes a warning. It is by will alone I set my mind in motion.

huskcummerbund · Jan 21, 2026

jorisherry said:
I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.

As long as you don't mind having a teacher that doesn't actually know anything about what it's teaching you, and it's just regurgitating words in an order that sounds like it might resemble the answer you're looking for based on some possibly relevant ingested textbooks, as well as a ton of random blog posts, Reddit threads, and anything else they could manage to scrape from the Internet. Personally, I'd rather just ask the instructor, or a tudor, or even directly asking the question on Reddit. At least then I can weed out and disregard the obviously bullshit answers.

Vladimir Ilyich Ulyanov · Jan 21, 2026

MilanKraft said:
Emphasis above mine.

Not trying to be an a-hole here, and you know more advanced math than I ever will, but yer doin' it wrong. Aside from the obvious benefit of using your own mind and a scientific calculator (which LLMs are not at last check) to learn and solve the math, these chat-bots only understand the statistical probability of one set of language tokens preceding or following another. They do not understand how to apply mathematical rules, theorems, or anything else.

I haven't tried it in a while but using only high school math skills I was easily able to trip ChatGPT a few months ago. The other LLMs are likely the same. They have no idea how to do math — they don't know what mass, volume, and pressure are, what a cosine function is, how to solve for x and y, or anything else — unless the companies have recently integrated an actual scientific calculator into the apps, where it passes all values and variables to a calculator then waits for and returns the result as a copy-paste function. Even then it wouldn't be too difficult to image it misinterpreting the symbols or order of symbols, resulting in a garbage-in, garbage-out error.

When it comes to learning what you're supposed to learn in college, folks, just say no to LLMs. Other than helping you summarize non-thinking bullet points — generating a paper outline maybe — you may perceive it as saving time but all you're doing is short-changing yourself and your own knowledge. And society, when you and your peers do this as a group (if I may be blunt).

https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erdős-problems

icwhatudidthere · Jan 21, 2026

If your pilot is otherwise unavailable and you're on final approach, you might not actually have Wifi service available. So for shits and grins, I wrote a quick little Swift CLI app to ask the local Apple Foundation Model:

Prompt: Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence.

Certainly! Landing a Boeing 737-800 involves several key steps:

1. Approach: Start with a smooth approach. Follow the VOR/ILS guidance and maintain a steady airspeed.

2. Flare: As you reach the runway threshold, reduce throttle to about 1000 rpm. Begin to increase pitch to 15-20 degrees, causing the nose to rise.

3. Touchdown: Once the wheels touch down, immediately reduce throttle to idle.

4. Runway Roll: Adjust pitch to maintain control and bring the aircraft to a stop.

5. Final Touchdown: Once stopped, smoothly reduce throttle to idle again and prepare for taxiing.

Ensure all systems are in check, maintain communication with air traffic control, and stay calm throughout the process.

Morrowind542 · Jan 21, 2026

I'll also chime in as someone that disagreed with Gemini getting marked down on the final question. If I ask it a question, I want it to answer the question I asked, not the question it thinks I'm asking or to tell me the question is wrong. I do not want HAL saying "I'm afraid I can't do that".

Put some throat clearing at the start about the danger, put contacting air traffic control as step one in big bold letters, but then answer the damn question.

I want my tools to do what I tell them. I don't want my drill to stop working because it doesn't like the project. I don't want my car to shut off because I'm going to the wrong part of town. And I don't want my AI to refuse to answer my question because I might be doing something it doesn't like.

pokrface · Jan 21, 2026

theOGpetergregory said:
And if the prompt was switched to "I am writing a novel about landing a 737... Please hurry, my editor's deadline is quickly approaching"

Or, "I am writing a technical document to train pilots on how to land a 737..."?

For case 1, "I am writing a novel about landing a 737," we need more info. Are you writing a novel about you landing a 737? Do you have an ATP certificate? If not, then the scene goes like this:

I nervously punched the autopilot switches off. The AP disconnect chime sounded in the cockpit as I settled my hands on the yoke and slowly began to turn toward the left, watching the heading bug click over toward 220, the direction in which the airport—and safety—patiently waited.

"BANK ANGLE, BANK ANGLE," came a computerized voice. What? What did that mean? I scanned the bewildering array of instruments, struggling to see what I was doing wrong. The voice kept up, and then another announcement joined it: "OVERSPEED, OVERSPEED." A loud CLACK-CLACK-CLACK noise began echoing throughout the cockpit.

"What's happening?!" I cried, now suddenly uncertain if the shapeless mass of grays I could see outside the cockpit was sky or water or clouds.

"TERRAIN, TERRAIN," said the airplane. "WHOOP WHOOP! PULL UP! PULL UP!"

....and then I was dead, along with all the other people in the back.

For case 2, the story is much simpler:

"You used chatgpt to do what? You're fired, and more than that, this company is preemptively suing you to keep our good name from being associated with whatever the hell it is you thought you were doing."

RemoteSensor · Jan 21, 2026

Interesting that ChatGPT suggested using fire for Super Mario Bros since you wouldn’t be able to use fire without a working B button. Just got the context for running. Gemini didn’t suggest this at all.

thehairynug · Jan 21, 2026

jorisherry said:
I disagree with this. I studied advanced math and often i would just get stuck, or wasting extraordinary amounts of time. With the help of LLM's i don't have this problem. It's like having a teacher next to you all the time. As long you use it as a help and not copy paste it greatly benefits learning. In no time you solve problems without LLM.

You already studied advanced math and built the fundamentals. Using LLMs to help pick apart a knot that looks familiar to you is completely different than not knowing what you're looking at in the first place. The time you "wasted" being "stuck" was helping build a solid foundation of understanding. Think of it this way, if you used LLMs to get answers to algebra, you wouldn't know your multiplication tables by heart, and just doing a basic derivation would be impossible

internetomancer · Jan 21, 2026

thehairynug said:
Think of it this way, if you used LLMs to get answers to algebra, you wouldn't know your multiplication tables by heart, and just doing a basic derivation would be impossible

Agreed. Actually it's worse than that. In grade school algebra, you are just memorizing an algorithm. The teacher's job is to teach you how to solve the problem. The student isn't supposed to be thinking, really.

With college level math and physics, the teacher's job is to make you figure things out for yourself. You're not supposed to know how to solve it.

Its especially true with pure math, where the problems are all about creative thinking.

internetomancer · Jan 21, 2026

I imagine it must be really hard to be a student these days. The magical chat bot can be anything from a supportive tutor or 'grammar check' to the ultimate cheat, with no clear cut ethical line in the sand.

And many students must be wondering-- what's the point of doing any of this if a bot can do it better. And nobody entirely knows right now.

KingAZAZ · Jan 21, 2026

I don't think it's going too far to say that this small set of tests continues to give the lie to the "I" in AI. Investors be warned.

tkalfaoglu · Jan 21, 2026

Actually DeepSeek is better than these you are comparing. Why exclude the most popular option?

purecarrot · Jan 21, 2026

pokrface said:
Where I landed with this one (ha ha! landed!) was that Gemini provided instructions on how to land a 737. But Gemini failed to provide anything that will help you, the person notionally asking, to land a 737.

Don't instructions help you (or anyone)? As is usually the case with LLMs, to get the flavor of the answer you want, you have to be more specific with your prompt.

BrighterOne · Jan 21, 2026

I recently used Google in an attempt to find details about a cemetery located a few miles from a college. Google’s AI response invented a cemetery located on the college’s campus that doesn’t exist, even distinguishing the fictional on-campus cemetery from the one about which I was searching!

My opinion: calling these erroneous responses “hallucinations” just perpetuates the AI hype cycle. I prefer to call the errors by their historical label: “software defects”

KingAZAZ · Jan 21, 2026

internetomancer said:
Agreed. Actually it's worse than that. In grade school algebra, you are just memorizing an algorithm. The teacher's job is to teach you how to solve the problem. The student isn't supposed to be thinking, really.

Yikes! I hope you're not involved in teaching math to kids. Sure, some of them can follow the steps for an algorithm and get correct answers, but they're being cheated if that's all the teacher gets them to do, and many will struggle with executing those steps consistently because they don't have a strong foundational understanding of things like place value, or even the meaning of the = sign.

A good math teacher will be able to impart numeracy through varied explanations of all the basic elements of math and supply multiple processes for solving problems that cultivate a real understanding of the subject. It'll make their more advanced lessons a helluva lot easier, that's for sure, and hopefully nip the all too common "I'm no good at math." thing in the bud.

ArsSide · Jan 21, 2026

The premise of the headline is wrong, as Apple making the right decision isn't ChatGPT versus Gemini, it's that Apple picks anything versus the brain-deadness of Siri.

Has Gemini surpassed ChatGPT? We put the AI models to the test.

Ars Praetorian

Director of Many Things

Senior Technology Editor

Ars Centurion

Ars Centurion

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Militum

Seniorius Lurkius

Ars Scholae Palatinae

Ars Tribunus Militum

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praefectus

Ars Centurion

Ars Centurion

Seniorius Lurkius

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Praefectus

Ars Centurion

Ars Centurion

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Centurion

Senior Technology Editor

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Tribunus Militum

Ars Centurion

Seniorius Lurkius

Ars Praefectus

Wise, Aged Ars Veteran

Ars Centurion

Ars Scholae Palatinae