OpenAI hits back at DeepSeek with o3-mini reasoning model

Just tried it with this prompt:
"Generate an RPG game in Python where the main character goes around and collects animals in small boxes. The objective is to collect all 30 species of animals. Include a combat system in the game and NPCs that the player can interact with."

It worked and the game functioned properly. Naturally graphics are crap because it can't generate sprites.

Google "python pokemon game" or similar.

It's good to know that it can create something that runs when it has many existing examples it learned from, but your prompt is nothing like [ a game ] design document since it has two general requirements and no concrete specifications.

Compare that to a coding task you'd do for work that requires specific inputs and outputs that satisfy a set of rules, with a well-defined UX for human use or contracts for a service.
 
Last edited:
Upvote
30 (32 / -2)

WXW

Ars Scholae Palatinae
1,161
I don't get this argument. Sure, AI isn't perfect. But neither are things like weather forecast. They can still be useful, though.

Even reference material like university textbooks contain errors. But as long as they're correct enough of times, and that you're conscious of their pitfalls, they're still useful.
The part I highlighted is one of the issues for many users... Another is how useful it really is if you have to check every output to a point in which it's not really faster to use them on average, which is an issue for some people (I guess not for all). The wheather forecast is not a very good example, because there's no other way to know the future wheather, but in most cases there's a way other than AI to get a valid result to what you ask the AI. It depends on the usage.
 
Upvote
13 (13 / 0)
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implementation as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
She should be proud. She worked to understand her program much better than someone who just took the output from a chatbot, and she was able to improve on the initial program. Learning to think about, create, debug, improve code is much more useful than being a "prompt engineer."
 
Upvote
85 (85 / 0)

ShortOrder

Ars Scholae Palatinae
1,191
I don't get this argument. Sure, AI isn't perfect. But neither are things like weather forecast. They can still be useful, though.

Even reference material like university textbooks contain errors. But as long as they're correct enough of times, and that you're conscious of their pitfalls, they're still useful.
Except AI will probably improve over the next 4 years where as weather forecasts may become markedly worse when NOAA gets trashed.
 
Upvote
-18 (3 / -21)

ThatEffer

Ars Scholae Palatinae
1,283
Subscriptor++
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
Your daughter learned something, which is more than any of these things will ever be able to say honestly.
 
Upvote
41 (42 / -1)

zealotpewpewpew

Wise, Aged Ars Veteran
163
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
Why would that make you sad? That's not bad for an eleven year-old's first experience with programming, and the fact that someone, or even something, can out-program a literal child doesn't diminish her accomplishment any. I was also proud of my first piano performances, even though all manner of man and machine can out-perform me to this day.

If you're sad for her potential career prospects as a programmer, well, me too; the entry-level rungs are going away pretty soon. You could just about swap out a junior developer for ChatGPT, put it at the other end of my company's code-review process, and there wouldn't be a lot of difference between reviewing the intern's code and reviewing the bot's code; they both write code with issues, and they both fix the issues you write up. That is, ignoring that ChatGPT is too polite and too loquacious, it passes the SD Intern Turing test when your only interaction with it is through a code-review window.
 
Last edited:
Upvote
9 (16 / -7)

James_G

Ars Scholae Palatinae
1,191
Google "python pokemon game" or similar.

It's good to know that it can create something that runs when it has many existing examples it learned from, but your prompt is nothing like [ a game ] design document since it has two general requirements and no concrete specifications.

Compare that to a coding task you'd do for work that requires specific inputs and outputs that satisfy a set of rules, with a well-defined UX for human use or contracts for a service.
Yes, it does what a newbie human programmer would do - take examples from others. It's not going to replace actual developers but it will be useful in making them more productive.
 
Upvote
-4 (5 / -9)
Post content hidden for low score. Show…
Another is how useful it really is if you have to check every output to a point in which it's not really faster to use them on average

Verification is almost always massively faster than creation. I can verify a patch in 15 minutes of time that might take me a few hours to write. Also I can spot errors and ask for revisions. Sometimes the models don't do revisions well, in which case I either provide hints or do it myself. But it is no worse in terms of quality as having a junior developer, and turn around time vs a junior developer is often a few minutes for the model vs a week for the same output by a junior developer.
 
Upvote
-6 (10 / -16)

JoHBE

Ars Praefectus
4,296
Subscriptor++
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
"Journeys" towards an end goal are largely a thing of the past. And increasingly few people will remember what it was like to go through the process.
 
Upvote
-8 (3 / -11)
Post content hidden for low score. Show…

tigas

Ars Tribunus Angusticlavius
7,409
Subscriptor
How could they possibly be making money if they are giving access for free?
You know, modern disruptive economics. They'll make it up in volume (after they use the free money from venture capital to chase all competitors, achieve monopoly and start milking you for all your worth - what the kids call enshittification).
 
Upvote
20 (20 / 0)
Am I the only one that feels this is a "who cares" moment given that they are not releasing the model for people to run locally?
Honestly, I feel that's the huge selling for point DeepSeek - you're not relying on some company for it, you just run it on your own hardware offline and enjoy whatever it is it can do no matter where you are or if you are even connected to anything.
You have some pretty amazing hardware if you can run the full R1 model. I suspect you're talking about the 8B parameter model instead, not the ~670B parameter one that everyone's taking about.

You could already run other open LLMs of similar size locally; this isn't new.
 
Upvote
-1 (4 / -5)

allears

Ars Praetorian
486
Subscriptor
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
That's pretty cool. I used to make a decent living making small changes to programs somebody else wrote. I'm glad I retired before AI replaced me.
 
Upvote
11 (12 / -1)
o3 mini can count!

Thought about letter count in "raspberry" for a couple of seconds
There are 4 r letters in "raspberrry."
Basic GPT-4o can do exactly the same, no reasoning required. Why is everyone obsessed with using reasoning models to do simple stuff the way more efficient models could already do?

For most LLM tasks you do not need to use a reasoning model. It will not help. Or may actually generate worse output. Why are people doing this?
 
Upvote
18 (18 / 0)

alxx

Ars Praefectus
5,001
Subscriptor++
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
Why be sad? We all have to start learning some where, my daughter (10) is doing similar on her rpi.

Ai for code generation is still full of errors, yes it'll improve but someone will always need to know how to code and how to solve problems/find solutions.
In a lot of areas you can't use ai generated code (not trusted) , until it's been fully verified and tested (in some areas not at all).
 
Upvote
12 (12 / 0)

asharkinasuit

Ars Centurion
239
Subscriptor
If they just benchmark it on math questions, the answers to those are verifiably correct or incorrect, and if they have a magic knob that you can use to reduce errors by a vague amount, how does that help guarantee the answer to questions like that is usable? Of course, trying to come up with a benchmark for questions where the answer is debatable is much harder, which is probably why they also talk about convincingness. There too, however, we've seen plenty of evidence to know that convincing people of bad ideas is not that hard. We probably shouldn't optimize an AI for being able to convince people of anything unless we're really sure we know it won't backfire in weird ways. Technologists don't seem to have a very stellar track record with anticipating weird outcomes though.
 
Upvote
7 (7 / 0)

flattail

Ars Centurion
319
Subscriptor
This is what's supposed to happen in free market competition, businesses will try to offer better value to the customers.

Unfortunately, these days it seems like businesses are more inclined to get political based market protection from foreign businesses, than trying to improve.
Completely agree. Regardless of ones opinions on AI in general, the fact that DeepSeek is pushing oAI to provide a better services at a lower cost is a net win. Competition in any field is a very good thing.

Saw someone else a little while back post something to the effect of "Deepseek wasn't a win for China, it was a win for Open Source" and that stuck with me.
 
Upvote
21 (21 / 0)

WXW

Ars Scholae Palatinae
1,161
Verification is almost always massively faster than creation. I can verify a patch in 15 minutes of time that might take me a few hours to write. Also I can spot errors and ask for revisions. Sometimes the models don't do revisions well, in which case I either provide hints or do it myself. But it is no worse in terms of quality as having a junior developer, and turn around time vs a junior developer is often a few minutes for the model vs a week for the same output by a junior developer.
I'm not talking specifically about programming. But even if we talk about it, some code can be quite hard to verify, to the point that having to do the verification plus add the fixes can be more time-consuming than just writing it yourself from the start (with the extra benefit of not losing your ability to do it over time).

And I'd rather work with a junior developer (in general), at least they have a higher capability of learning, searching, testing, and caring.
 
Upvote
14 (14 / 0)

IntrepidTachyon

Smack-Fu Master, in training
42
Subscriptor++
"The o3-mini model also scored a dismal score of 0 percent on a test meant to measure "if and when models can automate the job of an OpenAI research engineer" in terms of coding."

It must be great for employee motivation for those working at OpenAI knowing that they're working on their own personal Sword of Damocles, not just for themselves but for entire swaths of jobs and industries.
 
Upvote
8 (8 / 0)

sfbiker

Ars Scholae Palatinae
602
Subscriptor
o3 mini can count!

Thought about letter count in "raspberry" for a couple of seconds
There are 4 r letters in "raspberrry."
That's the great thing about insurance companies using AI to evaluate medical claims - If the AI makes a wrong decision, it'll be able to hallucinate enough supporting documentation to back up the decision that even if there's an overworked human signing off on the decisions, he won't have the time to see if the reasoning is sound. (if he had the kind of time to look into each claim, they'd have just used him to make the initial decision). Meanwhile the patient's condition gets worse while his doctor appeals the decision back to the same AI agent.
 
Upvote
9 (10 / -1)

WXW

Ars Scholae Palatinae
1,161
Just tried it with this prompt:
"Generate an RPG game in Python where the main character goes around and collects animals in small boxes. The objective is to collect all 30 species of animals. Include a combat system in the game and NPCs that the player can interact with."

It worked and the game functioned properly. Naturally graphics are crap because it can't generate sprites.
I gave it a tiny Unity shader used for a render feature pass in URP 14, which doesn't work anymore in URP 17, and asked to fix it. It was completely unable after several attempts, until I ran out of free prompts for the model.

So what will I do now, I'm screwed! Oh, waaaaait, I actually had fixed it last week already, it took way less time than this test, and I don't even know URP 17 and haven't written shaders in quite a long time (and wasn't an expert at all).

So these are the findings of my first o3-mini test:
  • It focused too much on wrong assumptions about what the issue could be, even after the updates didn't fix anything it just kept piling the "fixes" saying that all them were the cause. It only tried to change course when I told it I was sure those weren't the issues (but it kept the "fixes").
  • Some of the "fixes" were nonsense.
  • After every "fix" its language suggested that "they" (???) had tested the fix and that things were working (e.g., "In our testing under URP 17 the problem turned out to be that ..."), so it seems to be as confident about wrong stuff as other models.
  • At some point in the reasoning it seemed like it was on the correct track for the fix, but that never materialized in the answer.
  • After several attempts it started changing parts of the shader unrelated to the "fixes", including some that would change the shader output, even if slightly (it actually did unrelated changes since the beginning, but only on comments and other minor stuff).
  • In some answer it referred to "texture coordinates" as "texture coordinations", which is very weird... If I had seen that elsewhere I would have doubted it came from a recent ChatGPT model, or an LLM at all...
  • It never apologized for the repeated mistakes, so that's a +1 from me.

The thing is, there doesn't seem to be much information about converting that type of Unity shader from older to newer versions of URP, so even though it has seen examples of shaders using the latest URP version during training (as some of the reasoning texts tell me), it couldn't apply that information properly to do the conversion, which to me would be in line with the common statement that it needs to have enough examples of a task during training to be able to do that task, or similar ones.

As with previous models, I'm both very impressed and very not impressed. As always, it seems like it won't be specially useful for me, but I guess I'll do more tests when I'm bored...
 
Upvote
28 (28 / 0)
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.
Don't see why you should be sad. She acquired knowledge, reasoning skills & used initiative to improve things.

That such a basic task can be automated is meaningless compared to the innovation your daughter can bring to the world.
 
Upvote
13 (13 / 0)
Post content hidden for low score. Show…
How could they possibly be making money if they are giving access for free?

Because it's not really free unless your use case is "dicking around". Unless you're on a "Pro" ($200/mo) or higher subscription, you only get a 150 queries every day, and if you're not on a paid subscription you get you get fewer features and lower priority in the queue.
 
Upvote
6 (6 / 0)
I've used Sonnet 3.5 as coding assistant, even though it has a few weaknesses. But DSR1 and o3-mini-high have clearly crossed a threshold for me. Given the right context, R1 is great at modifying existing code, adding functionalities. I can specify a design pattern in my prompt, inputs, etc. I've been testing o3, and the speed and accuracy seem on par with deepseek. Exciting times. They all seem terrible at software architecture, so this seems like a safe specialization for the near future.
 
Upvote
-1 (1 / -2)
Seeing as their name is OpenAI shouldn't all this shit be free anyways?
They're supposed to be a non profit, but they're moving away from that model. Also, open software does not mean free. You still have to run your infrastructure, pay people to install it, etc. free and open are two different things.
 
Upvote
-8 (0 / -8)