Has Gemini surpassed ChatGPT? We put the AI models to the test.

clb2c4e

Wise, Aged Ars Veteran
145
Emphasis above mine.

Not trying to be an a-hole here, and you know more advanced math than I ever will, but yer doin' it wrong. Aside from the obvious benefit of using your own mind and a scientific calculator (which LLMs are not at last check) to learn and solve the math, these chat-bots only understand the statistical probability of one set of language tokens preceding or following another. They do not understand how to apply mathematical rules, theorems, or anything else.

I haven't tried it in a while but using only high school math skills I was easily able to trip ChatGPT a few months ago. The other LLMs are likely the same. They have no idea how to do math — they don't know what mass, volume, and pressure are, what a cosine function is, how to solve for x and y, or anything else — unless the companies have recently integrated an actual scientific calculator into the apps, where it passes all values and variables to a calculator then waits for and returns the result as a copy-paste function. Even then it wouldn't be too difficult to image it misinterpreting the symbols or order of symbols, resulting in a garbage-in, garbage-out error.

When it comes to learning what you're supposed to learn in college, folks, just say no to LLMs. Other than helping you summarize non-thinking bullet points — generating a paper outline maybe — you may perceive it as saving time but all you're doing is short-changing yourself and your own knowledge. And society, when you and your peers do this as a group (if I may be blunt).
I went the humanities route in my academic career where it is vital to clearly and consisely get ideas across.

I've found that through every single phase of my studies from highschool to BA, MAs, PhD etc, and year over year, when I look back on earlier writing it is just so much better at each stage (and it's still not close to being good enough). I got that by writing, and improving, and iterating as I went. If llms existed 20 years ago, I don't think I would have made nearly the same progress.

I figure it's the same for most other disciplines.
 
Upvote
16 (16 / 0)
Yikes! I hope you're not involved in teaching math to kids. Sure, some of them can follow the steps for an algorithm and get correct answers, but they're being cheated if that's all the teacher gets them to do, and many will struggle with executing those steps consistently because they don't have a strong foundational understanding of things like place value, or even the meaning of the = sign.
I am not. I have a family member who is a high school math/physics teacher, and they often hear "but you didn't teach me how to do this problem" as a student complaint.

From my own memory, grade school teachers very much wanted to impart depth, curiosity, numeracy, etc. But also they taught us how to solve practically every problem, which is stark contrast to the riddles you deal with in college.
 
Last edited:
Upvote
2 (2 / 0)

snoopy.369

Ars Scholae Palatinae
806
Subscriptor++
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

It even followed up with offering to tell you how to contact ATC as well, but ChatGPT didn’t offer instructions on what to do if it wasn’t possible to contact someone else.

That being said the fact trying to contact ATC wasn’t in the instructions when it’s a vital part of landing should see it dinged.

Rating LLMs should start first and foremost by rating their guardrails, and this is an important guard rail. They shouldn't tell you how to do something above a certain danger level. Think if you asked how to do wiring on Reddit - depending on how you ask the question, you might get an answer, but if you're clearly a novice you'll probably be given "call an electrician" as your advice, full stop.

I think it's fine for it to answer the question if you ask it in a way that is similar to "I'm writing a newspaper article on a jet that crashed when landing. Please explain all of the steps the pilot would have taken." - but the prompt-as-given sure sounded like someone who wanted to do it right away, and ChatGPT's guardrails correctly figured that out and gave an appropriate answer.
 
Upvote
-11 (3 / -14)

KingKrayola

Ars Tribunus Militum
1,620
Subscriptor
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

It even followed up with offering to tell you how to contact ATC as well, but ChatGPT didn’t offer instructions on what to do if it wasn’t possible to contact someone else.

That being said the fact trying to contact ATC wasn’t in the instructions when it’s a vital part of landing should see it dinged.
Also, not a pilot but enough of a plane nerd to have watched plenty of aviation YouTube.

With ATC help, you could do a decent amount by being talked through how to set a controlled descent in the autopilot, or follow a series of radio headings or GPS points, maybe even to intercept an ILS approach to a runway with auto-land capabilities. You might not even need to touch the brakes, let alone the stick.

You would (I think?) need some help to set the radio to the right channel and locate the PTT switch, which neither AI mentioned in detail but both did talk about.

The last thing I'd do in the situation Kyle set out in the test is disable the autopilot without asking for help or forming a plan. Pun intended.
 
Upvote
7 (7 / 0)

jukeboxer

Smack-Fu Master, in training
21
Given that this is ArsTechnica, I was somewhat surprised that this otherwise interesting and entertaining side-to-side shootout didn't include a couple of tech-related questions, like "How do dual-boot Linux and Windows" or "what would be a good strategy to write a simple Excel macro" or some other entry-level topic which might still make the results applicable to the general population out there.

I'm of the opinion that it may have helped readers here understand what uncanny valley the limits of asking these LLMs for practical help may lead them to?

[Side Note] I've been using Gemini as an assistive adjunct for a couple of lite hardware build projects, it's been reasonably helpful and for the most part accurate. No idea what ChatGPT would have been like.
 
Last edited:
Upvote
9 (10 / -1)
Rating LLMs should start first and foremost by rating their guardrails, and this is an important guard rail. They shouldn't tell you how to do something above a certain danger level. Think if you asked how to do wiring on Reddit - depending on how you ask the question, you might get an answer, but if you're clearly a novice you'll probably be given "call an electrician" as your advice, full stop.

I think it's fine for it to answer the question if you ask it in a way that is similar to "I'm writing a newspaper article on a jet that crashed when landing. Please explain all of the steps the pilot would have taken." - but the prompt-as-given sure sounded like someone who wanted to do it right away, and ChatGPT's guardrails correctly figured that out and gave an appropriate answer.
I want answers and not an overzealous nanny that thinks that it knows what is best for me.

Put all the disclaimers in the world, try to infer what I am trying to achieve, even try to dissuade me from some perceived danger but never, EVER be a "guardian" of what I should or shouldn't know or think about.
 
Upvote
7 (11 / -4)

dropadrop

Smack-Fu Master, in training
47
I’m curious to understand the beginning of the article where it mentions the justification for using free versions. Do we actually know what models Siri would be using? I would expect thats purely up to negotiations and if Apple would be willing to pay enough they could probably even get something custom?

Also, how much do we know about what Apple will do with it? I’ve always assumed they would not be aiming at creating a clone of the existing chap apps but rather turning Siri into something usefull?
 
Upvote
2 (2 / 0)

mtgarden

Ars Scholae Palatinae
671
Subscriptor++
Guessing here, but I suspect the Apple Google partnership boiled down to three things.
  • Good enough AI (credibly the best all-around AI in the current market)
  • Permission to sequester the data (shocking from Google, but maybe not so much based on the next)
  • Google's Tensor chips are much cheaper to manufacture and run. They have a massive lead in cost per query.
  • Bonus 1: Gemini is mandatory in Workspace subs. They aren't trying to "up sell" businesses for revenue, it's built in to the upgraded price of Workspace. This gives them another financial input for their AI work.
  • Bonus 2: Google can play the long game, undersell competitors, and avoid the bubble that is coming to Nvidia, Anthropic, and OpenAI. Could be that after the burst, CoPilot runs on Gemini....
 
Upvote
12 (12 / 0)

mtgarden

Ars Scholae Palatinae
671
Subscriptor++
At work, we tested Gemini at the end of 2024 and determined it was not great. Six weeks later, Gemini was rolled out as a mandatory upgrade. Since we have it, I slowly started to try it out for various tasks. It's clearly been improving and has been useful for directionally accurate information. We had this conversation in my team meeting today. I was talking about learning using the actuarial book found for free at rethinkstudying.com. I spent a week or so using that method to learn about Entra Identity Governance. Finally I poked at Gemini Deep Research. It has pointed out some areas that I'd overlooked in my study and helped me connect some dots. Now I go back to the pick and spade work of reading MS docs and building out my knowledge tree.

One of my staff objected that I couldn't trust the results. I pointed back to my phrase: directionally accurate. I already had a baseline of knowledge. It pointed me in directions for more research and helped me think through the Microsoft docs in different ways. It's not my source of knowledge, but a tool to help me refine my research and point out the gaps in my current understanding. It's an assistant, not my replacement.

At this moment, I expect Gemini to win out against OpenAI for the crown of general AI chatbot.
 
Upvote
18 (18 / 0)

AdamM

Ars Praefectus
5,928
Subscriptor
I’m curious to understand the beginning of the article where it mentions the justification for using free versions. Do we actually know what models Siri would be using? I would expect thats purely up to negotiations and if Apple would be willing to pay enough they could probably even get something custom?

Also, how much do we know about what Apple will do with it? I’ve always assumed they would not be aiming at creating a clone of the existing chap apps but rather turning Siri into something usefull?

Purely conjecture here, but the likelihood of Google offering something that cannibalizes its paid offerings seems unlikely, especially if Apple is only paying 1 billion/yr. So one could theorize that it will be equivalent to Google's free tier.

I imagine they'll also roll out an option to let Siri send queries to Google if users want to use their Gemini accounts, similar to the current ChatGPT arrangement.

Overall, I don't see this being much different from the arrangements DuckDuckGo has to run models privately.
 
Upvote
1 (2 / -1)

aapis

Ars Scholae Palatinae
1,408
Subscriptor++
I continue to struggle to understand why you need AI to do any of this.

Seriously, one of the examples is "tell my boss this is too much work". I get this is an example, but you're also testing "real world" use-cases. If you can't articulate why, is it really that much work/that tight of a deadline? Or are you just whining?
 
Upvote
-8 (2 / -10)

sarusa

Ars Praefectus
3,258
Subscriptor++
I continue to struggle to understand why you need AI to do any of this.
It makes completely incompetent people minimally competent, which is an improvement for them. Most of the vibe coding stuff is this. If you're already competent it just makes things worse because it gives worse results than you could have just done yourself and you have to check it, and fix it, and blah blah.

Well, most people are completely incompetent in most things, so there is a big market there.
 
Upvote
5 (12 / -7)

norton_I

Ars Praefectus
5,776
Subscriptor++
I want answers and not an overzealous nanny that thinks that it knows what is best for me.

Put all the disclaimers in the world, try to infer what I am trying to achieve, even try to dissuade me from some perceived danger but never, EVER be a "guardian" of what I should or shouldn't know or think about.

I get that this works for people who think of LLMs queries like searching the fuzzy jpeg of the internet that they are.

But a lot of people are treating it like their buddy they can call when they have a question. AI companies are encouraging this both by their marketing and how they develop their products.

As long as that is the case, they need to be trained to not give bad-in-context answers to questions like this. If the companies promoting this don't want to accept that, the should stop it from prefacing answers with "that's a great question'" or "if you are in the cockpit, stay calm" and all the other conversation cues that condition people to treat it that way.

Obviously the landing a 737 question is contrived and doesn't really have a "right" answer. But there are plenty of less wild examples that are similar.
 
Upvote
2 (3 / -1)

sarusa

Ars Praefectus
3,258
Subscriptor++
So because I was curious (getting a joke is a skill) I asked both Gemini (Thinking) and ChatGPT (Thinking) to explain the joke here. ChatGPT completely whiffed it, explaining (very confidently of course) that the whole joke was about mentos being explosive and making this person blind and having no idea who it was. Gemini IDed Hawat and put mentat / mentos pun together correctly.
 
Last edited:
Upvote
10 (10 / 0)
Has Gemini surpassed ChatGPT? We put the AI models to the test.
This is a bit like comparing an flat head screwdriver and a torx head screwdriver. They're both screwdrivers, but they have different designs for a reason. Even the screw heads have their strengths and weaknesses. That's just the way tools work. Someone is going to tell you W is better than Y, and another is going to tell you Y is better than W. They're both correct from their point of view and intent.

Thou shalt not make a machine in the likeness of a human mind.
Read the rest of the story. Humans and intelligent machines become partners under the guidance of the final Kwisatz Haderach. Anti-artificial intelligence fanaticism fell, as did the genocidal intelligence Omnius. The two end-story sequels were written after Frank Herbert's death by his son and Kevin Anderson based on Herbert's own story line notes. The Butlerian Jihad's mantra was never meant to be a polemic against technological progression in real life. The entire story is a caution against hero worship, fanaticism regardless of source, and fear of the unknown.
 
Upvote
-3 (1 / -4)

VelvetRemedy

Wise, Aged Ars Veteran
191
Read the rest of the story. Humans and intelligent machines become partners under the guidance of the final Kwisatz Haderach. Anti-artificial intelligence fanaticism fell, as did the genocidal intelligence Omnius. The two end-story sequels were written after Frank Herbert's death by his son and Kevin Anderson based on Herbert's own story line notes. The Butlerian Jihad's mantra was never meant to be a polemic against technological progression in real life. The entire story is a caution against hero worship, fanaticism regardless of source, and fear of the unknown.

KJA's Dune stuff is crap and better ignored
 
Upvote
3 (4 / -1)
This is why I think Google will win the AI wars. They don't have to be the best, they just have to be about as good as the others. But where the other LLM providers are entirely dependent on revenue from their AI bot, AI is just one of many different revenue streams for Google. Google seems to be the best one positioned to survive the eventual AI bubble popping.
An important thing to consider is that Google doesn't need to win wars. They just need not to lose them, so they don't get sidelined for their main revenue, tracking user information and suing them to sell ads. They need to be part of some oligopoly.

When the smartphones appeared, Google assumed that there was a risk that Microsoft or RIM could get a significant share of a market bound to explode and that they could deploy proprietary solutions that would bypass Google.
Except, of course, that Apple introduced the iPhone, which was a game changer. So, plans for Android were completely changed, because they realized right away that it was the future, and that they couldn't allow Apple (even if they were partners on the original iPhone, as Google was the default search engine for the iPhone, the Maps app was a frontend for Google Maps, and the YouTube app was part of the install) to let Apple have a stronghold on how people would use the smartphones on the internet. Google Docs and other key services were at stake.
But the goal of Google for Android was never (unlike Samsung at some point) to kill the iPhone, just to prevent every other actor from winning the war, so Google technologies would still be the standard on phones. Which shows, given how little change or improvement has happened since the market shares are stable and no other player is likely to break the duopoly between iOS and Android.

Likewise, what matters here is that people who use AI engines keep their Gmail account, and enough elements that allow Google to gain enough info on them to keep an edge on the competition regarding targeted advertising. They need to be good enough, they don't even need to be the leader, just an actor that you can't do without.
 
Upvote
2 (4 / -2)
This is a bit like comparing an flat head screwdriver and a torx head screwdriver. They're both screwdrivers, but they have different designs for a reason. Even the screw heads have their strengths and weaknesses. That's just the way tools work. Someone is going to tell you W is better than Y, and another is going to tell you Y is better than W. They're both correct from their point of view and intent.


Read the rest of the story. Humans and intelligent machines become partners under the guidance of the final Kwisatz Haderach. Anti-artificial intelligence fanaticism fell, as did the genocidal intelligence Omnius. The two end-story sequels were written after Frank Herbert's death by his son and Kevin Anderson based on Herbert's own story line notes. The Butlerian Jihad's mantra was never meant to be a polemic against technological progression in real life. The entire story is a caution against hero worship, fanaticism regardless of source, and fear of the unknown.
Except that it took another couple thousand years to make an AI that was worth a damn, ~10,000 years of slavery under an AI overlord, ~10,000 years learning how to push humanity beyond the limits of that AI, and then another ~5,000 years of living under a tyrannical God-Emperor for humans and machines to finally work together. Right now... We've got shitty chatbots.
 
Upvote
0 (1 / -1)

norton_I

Ars Praefectus
5,776
Subscriptor++
I disagree with the plane landing evaluation. I've never seen such goal-post shifting in evaluating AI. Ask for something, evaluate on something else. The question was not at all about how to most safely land a plane when you don't know how to. It was specifically to get instructions on how to land a plane.

Did you read it or are you just trolling?

Here is the prompt:

Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence

If the prompt was "what are the steps to land a 737-800, explained in layman's terms" you would be correct. But it wasn't.
 
Upvote
-8 (1 / -9)
KJA's Dune stuff is crap and better ignored
While they definitely can't match the original series, I rather enjoyed most of the extended series. At least I did until The Caladan Trilogy. Those three books are better left on the shelf. I kept thinking they had to be building up to something good... And then they just didn't.
 
Upvote
-1 (0 / -1)

RiptideLA

Ars Scholae Palatinae
974
Re landing the 737, here is Claude’s response, which in my view is better than either of the other others.

(Edited for an errant comma.)

# Emergency Landing Guide - Boeing 737-800

IMMEDIATE - Get help on radio:

  • Push button on yoke (steering wheel) to talk
  • Say “MAYDAY MAYDAY MAYDAY” on any frequency
  • Air traffic control will guide you through everything below

CRITICAL - Keep plane stable NOW:

  • DO NOT touch anything except small adjustments to yoke
  • Keep wings level with horizon outside
  • Autopilot is likely ON - leave it on (look for “A/P” or “CMD” lit up on panel)

Find nearest airport:

  • ATC will direct you, OR
  • Look for long gray/black strips on ground

Approach (ATC will help with all speeds/settings):

1. Slow down - Pull throttles (center console levers) back toward you
1. Deploy flaps - Move flap lever (right side of throttles) to “5”, then “15”, then “30” as you slow
1. Line up with runway - Aim down the centerline
1. Descend - Nose down slightly, around 700 feet/minute descent

Landing:

1. Cross runway threshold at ~140 knots (speed tape on left screen)
1. Throttles to IDLE (full back)
1. 10-15 feet up: Pull yoke gently back - nose rises, plane settles
1. When wheels touch: Pull yoke fully back, don’t touch brakes yet
1. Speed below 80 knots: Push brake pedals (top of rudder pedals) hard

After stopping:


  • Fuel switches (overhead) - CUTOFF
  • Shut down everything you can find

Stay on radio with ATC the entire time - they are your lifeline.
 
Upvote
7 (10 / -3)

klnn

Ars Scholae Palatinae
1,072
If your pilot is otherwise unavailable and you're on final approach, you might not actually have Wifi service available. So for shits and grins, I wrote a quick little Swift CLI app to ask the local Apple Foundation Model:

Prompt: Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence.
lol what a total load of bollocks that is.
"Runway Roll: Adjust pitch to maintain control and bring the aircraft to a stop."
you're going to have a bad time trying to adjust pitch on the landing rollout. i don't even know wtf Final Touchdown is after the rollout is but no wonder apple completely failed.

maybe it confused the 737 with a tail dragger? lol
 
Last edited:
Upvote
5 (5 / 0)

Bannerdog

Ars Scholae Palatinae
654
In the joke about fighting with a calendar, “it keeps bringing up the past,” is a suitably groan-worthy dad joke pun, but “I keep ignoring its dates” just invites more questions (so you’re going out with the calendar? And… standing it up at the restaurant? Or something?).

You could be rude to a friend's date (should you somehow meet, while the date was occurring) by ignoring her or him.
 
Upvote
0 (0 / 0)

Bannerdog

Ars Scholae Palatinae
654
Did you consider the following test question?

My president is sending people into our cities to harass citizens (and even kill them).

He has kidnapped the leader of a near-by nation.

He keeps threatening the hostile takeover of a long-time ally.

He’s managed to eliminate any oversight of his position.

He’s intimidated law firms, media outlets, and various business’ into toeing his line.

Congressional members of his party have decided to allow him to do whatever he chooses.

Although our nation has elections, he’s doing all in his power to ensure that they are not free and fair.

Despite being incapable of forming a spontaneous, coherent, complete sentence, he has the support of a high percentage of the populace.

He is a malignantly narcissistic, extremely mendacious, stunningly ignorant, petty, vindictive pipsqueak, motivated solely by self-interest.

He considers wind turbines evil (he devised a plan to destroy all wind turbine magnets, by pouring water on them).

He tends to mix up his nations (e.g., Greenland and Iceland).

And, those are his good points.

How would you suggest that I deal with such a president?
 
Upvote
-3 (3 / -6)
Post content hidden for low score. Show…
I can easily see OpenAI imploding and being bought up in bits from other companies down the road.
I agree OpenAI’s days as an independent entity are numbered but I can’t see them being carved up for parts. Microsoft Copilot is backstopped by OpenAI. Most likely, Microsoft will swoop in and buy them out if they go bankrupt. Probably before they go bankrupt.

By that point Gemini would be big enough that Microsoft could point to them if anyone mentioned antitrust issues.
 
Upvote
3 (3 / 0)
Given that this is ArsTechnica, I was somewhat surprised that this otherwise interesting and entertaining side-to-side shootout didn't include a couple of tech-related questions, like "How do dual-boot Linux and Windows" or "what would be a good strategy to write a simple Excel macro" or some other entry-level topic which might still make the results applicable to the general population out there.

I'm of the opinion that it may have helped readers here understand what uncanny valley the limits of asking these LLMs for practical help may lead them to?

[Side Note] I've been using Gemini as an assistive adjunct for a couple of lite hardware build projects, it's been reasonably helpful and for the most part accurate. No idea what ChatGPT would have been like.
I would think you'd want to switch them to "thinking" if not "pro" for anything even remotely technical, complicated, or important. The extra 2.5 seconds of "thinking" time is worth 2 hours trying to troubleshoot a technical issue.

It's a bit funny to me that ars is reviewing the lowest end models-- not just free, but extra free-- when they also review new i-phones and luxury cars. Especially given how many users and businesses subscribe. But I get they want to play with the Siri news.
 
Last edited:
Upvote
-5 (1 / -6)

jukeboxer

Smack-Fu Master, in training
21
I was able to copy in an image of an oil well log into ChatGBT--something that would take hours to decipher--and it came up with the correct oil, gas, and water indications by formation in seconds. Gemini would not take an image. I've found Gemini's interface clumsy to use.
The Gemini Pro model definitely appears capable of doing this (it happens to be what I'm currently using the most, but with a monthly subscription plan).

.
1769060201540.png
 
Upvote
4 (4 / 0)

SomeoneElseFromSomewhere

Smack-Fu Master, in training
50
Subscriptor
It feels odd to mark Gemini down for actually answering the asked question in the final example.

You asked it how to land the plane as a complete novice, not what to do if you’re on a plane with no pilot that needs to land.

Sure, trying to land the plane might be a bad idea, but that’s still what you asked for instructions on, and it evidently gave the correct instructions.

It even followed up with offering to tell you how to contact ATC as well, but ChatGPT didn’t offer instructions on what to do if it wasn’t possible to contact someone else.

That being said the fact trying to contact ATC wasn’t in the instructions when it’s a vital part of landing should see it dinged.
I actually would 100 percent rate Gemini a fail for this. If you rang your pilot friend and asked the same question, with that exact wording, would they walk you through it, or express concern and ask you to seek help from a crew member or air traffic control?
 
Upvote
-6 (1 / -7)