OpenAI hits back at DeepSeek with o3-mini reasoning model

DaveSimmons · Jan 31, 2025

James_G said:
Just tried it with this prompt:
"Generate an RPG game in Python where the main character goes around and collects animals in small boxes. The objective is to collect all 30 species of animals. Include a combat system in the game and NPCs that the player can interact with."

It worked and the game functioned properly. Naturally graphics are crap because it can't generate sprites.

Google "python pokemon game" or similar.

It's good to know that it can create something that runs when it has many existing examples it learned from, but your prompt is nothing like [ a game ] design document since it has two general requirements and no concrete specifications.

Compare that to a coding task you'd do for work that requires specific inputs and outputs that satisfy a set of rules, with a well-defined UX for human use or contracts for a service.

WXW · Jan 31, 2025

Big Wang said:
I don't get this argument. Sure, AI isn't perfect. But neither are things like weather forecast. They can still be useful, though.

Even reference material like university textbooks contain errors. But as long as they're correct enough of times, and that you're conscious of their pitfalls, they're still useful.

The part I highlighted is one of the issues for many users... Another is how useful it really is if you have to check every output to a point in which it's not really faster to use them on average, which is an issue for some people (I guess not for all). The wheather forecast is not a very good example, because there's no other way to know the future wheather, but in most cases there's a way other than AI to get a valid result to what you ask the AI. It depends on the usage.

DaveSimmons · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implementation as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

She should be proud. She worked to understand her program much better than someone who just took the output from a chatbot, and she was able to improve on the initial program. Learning to think about, create, debug, improve code is much more useful than being a "prompt engineer."

ShortOrder · Jan 31, 2025

Big Wang said:
I don't get this argument. Sure, AI isn't perfect. But neither are things like weather forecast. They can still be useful, though.

Even reference material like university textbooks contain errors. But as long as they're correct enough of times, and that you're conscious of their pitfalls, they're still useful.

Except AI will probably improve over the next 4 years where as weather forecasts may become markedly worse when NOAA gets trashed.

ThatEffer · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

Your daughter learned something, which is more than any of these things will ever be able to say honestly.

zealotpewpewpew · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

Why would that make you sad? That's not bad for an eleven year-old's first experience with programming, and the fact that someone, or even something, can out-program a literal child doesn't diminish her accomplishment any. I was also proud of my first piano performances, even though all manner of man and machine can out-perform me to this day.

If you're sad for her potential career prospects as a programmer, well, me too; the entry-level rungs are going away pretty soon. You could just about swap out a junior developer for ChatGPT, put it at the other end of my company's code-review process, and there wouldn't be a lot of difference between reviewing the intern's code and reviewing the bot's code; they both write code with issues, and they both fix the issues you write up. That is, ignoring that ChatGPT is too polite and too loquacious, it passes the SD Intern Turing test when your only interaction with it is through a code-review window.

James_G · Jan 31, 2025

DaveSimmons said:
Google "python pokemon game" or similar.

It's good to know that it can create something that runs when it has many existing examples it learned from, but your prompt is nothing like [ a game ] design document since it has two general requirements and no concrete specifications.

Compare that to a coding task you'd do for work that requires specific inputs and outputs that satisfy a set of rules, with a well-defined UX for human use or contracts for a service.

Yes, it does what a newbie human programmer would do - take examples from others. It's not going to replace actual developers but it will be useful in making them more productive.

LetterRip · Jan 31, 2025

breze said:
How could they possibly be making money if they are giving access for free?

The free access is only for 4-5 queries a day. For people who need it, they are often doing 100's or 1000's of queries a day and will have to buy a subscription or pay a price per million tokens.

LetterRip · Jan 31, 2025

WXW said:
Another is how useful it really is if you have to check every output to a point in which it's not really faster to use them on average

Verification is almost always massively faster than creation. I can verify a patch in 15 minutes of time that might take me a few hours to write. Also I can spot errors and ask for revisions. Sometimes the models don't do revisions well, in which case I either provide hints or do it myself. But it is no worse in terms of quality as having a junior developer, and turn around time vs a junior developer is often a few minutes for the model vs a week for the same output by a junior developer.

JoHBE · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

"Journeys" towards an end goal are largely a thing of the past. And increasingly few people will remember what it was like to go through the process.

Danathar · Jan 31, 2025

Can’t download it and run on my rig. Pass

tigas · Jan 31, 2025

breze said:
How could they possibly be making money if they are giving access for free?

You know, modern disruptive economics. They'll make it up in volume (after they use the free money from venture capital to chase all competitors, achieve monopoly and start milking you for all your worth - what the kids call enshittification).

pseudobscura · Jan 31, 2025

CrisR82 said:
Am I the only one that feels this is a "who cares" moment given that they are not releasing the model for people to run locally?
Honestly, I feel that's the huge selling for point DeepSeek - you're not relying on some company for it, you just run it on your own hardware offline and enjoy whatever it is it can do no matter where you are or if you are even connected to anything.

You have some pretty amazing hardware if you can run the full R1 model. I suspect you're talking about the 8B parameter model instead, not the ~670B parameter one that everyone's taking about.

You could already run other open LLMs of similar size locally; this isn't new.

allears · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

That's pretty cool. I used to make a decent living making small changes to programs somebody else wrote. I'm glad I retired before AI replaced me.

pseudobscura · Jan 31, 2025

otila said:
o3 mini can count!

Thought about letter count in "raspberry" for a couple of seconds
There are 4 r letters in "raspberrry."

Basic GPT-4o can do exactly the same, no reasoning required. Why is everyone obsessed with using reasoning models to do simple stuff the way more efficient models could already do?

For most LLM tasks you do not need to use a reasoning model. It will not help. Or may actually generate worse output. Why are people doing this?

alxx · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

Why be sad? We all have to start learning some where, my daughter (10) is doing similar on her rpi.

Ai for code generation is still full of errors, yes it'll improve but someone will always need to know how to code and how to solve problems/find solutions.
In a lot of areas you can't use ai generated code (not trusted) , until it's been fully verified and tested (in some areas not at all).

asharkinasuit · Jan 31, 2025

If they just benchmark it on math questions, the answers to those are verifiably correct or incorrect, and if they have a magic knob that you can use to reduce errors by a vague amount, how does that help guarantee the answer to questions like that is usable? Of course, trying to come up with a benchmark for questions where the answer is debatable is much harder, which is probably why they also talk about convincingness. There too, however, we've seen plenty of evidence to know that convincing people of bad ideas is not that hard. We probably shouldn't optimize an AI for being able to convince people of anything unless we're really sure we know it won't backfire in weird ways. Technologists don't seem to have a very stellar track record with anticipating weird outcomes though.

flattail · Jan 31, 2025

Big Wang said:
This is what's supposed to happen in free market competition, businesses will try to offer better value to the customers.

Unfortunately, these days it seems like businesses are more inclined to get political based market protection from foreign businesses, than trying to improve.

Completely agree. Regardless of ones opinions on AI in general, the fact that DeepSeek is pushing oAI to provide a better services at a lower cost is a net win. Competition in any field is a very good thing.

Saw someone else a little while back post something to the effect of "Deepseek wasn't a win for China, it was a win for Open Source" and that stuck with me.

WXW · Jan 31, 2025

LetterRip said:
Verification is almost always massively faster than creation. I can verify a patch in 15 minutes of time that might take me a few hours to write. Also I can spot errors and ask for revisions. Sometimes the models don't do revisions well, in which case I either provide hints or do it myself. But it is no worse in terms of quality as having a junior developer, and turn around time vs a junior developer is often a few minutes for the model vs a week for the same output by a junior developer.

I'm not talking specifically about programming. But even if we talk about it, some code can be quite hard to verify, to the point that having to do the verification plus add the fixes can be more time-consuming than just writing it yourself from the start (with the extra benefit of not losing your ability to do it over time).

And I'd rather work with a junior developer (in general), at least they have a higher capability of learning, searching, testing, and caring.

SubWoofer2 · Jan 31, 2025

David Mayer said:
Which is how America put a flag on the moon right? Oh wait, Apollo was paid for by the government!

The advantages of having a large command-and-control economy embedded within a wrapper of capitalism. That military-industrial complex is there for a reason.

WXW · Jan 31, 2025

Arkannis said:
Why are you lying?

View attachment 101590

That doesn't imply they are lying, LLMs give different answers for the same input basically all the time, some correct and some wrong.

IntrepidTachyon · Jan 31, 2025

"The o3-mini model also scored a dismal score of 0 percent on a test meant to measure "if and when models can automate the job of an OpenAI research engineer" in terms of coding."

It must be great for employee motivation for those working at OpenAI knowing that they're working on their own personal Sword of Damocles, not just for themselves but for entire swaths of jobs and industries.

sfbiker · Jan 31, 2025

otila said:
o3 mini can count!

Thought about letter count in "raspberry" for a couple of seconds
There are 4 r letters in "raspberrry."

That's the great thing about insurance companies using AI to evaluate medical claims - If the AI makes a wrong decision, it'll be able to hallucinate enough supporting documentation to back up the decision that even if there's an overworked human signing off on the decisions, he won't have the time to see if the reasoning is sound. (if he had the kind of time to look into each claim, they'd have just used him to make the initial decision). Meanwhile the patient's condition gets worse while his doctor appeals the decision back to the same AI agent.

mrkite77 · Jan 31, 2025

purecarrot said:
They don't make any money, they are losing a ton of it.

That's the modern company for you. Lose money until you get bought.

WXW · Jan 31, 2025

James_G said:
Just tried it with this prompt:
"Generate an RPG game in Python where the main character goes around and collects animals in small boxes. The objective is to collect all 30 species of animals. Include a combat system in the game and NPCs that the player can interact with."

It worked and the game functioned properly. Naturally graphics are crap because it can't generate sprites.

I gave it a tiny Unity shader used for a render feature pass in URP 14, which doesn't work anymore in URP 17, and asked to fix it. It was completely unable after several attempts, until I ran out of free prompts for the model.

So what will I do now, I'm screwed! Oh, waaaaait, I actually had fixed it last week already, it took way less time than this test, and I don't even know URP 17 and haven't written shaders in quite a long time (and wasn't an expert at all).

So these are the findings of my first o3-mini test:

It focused too much on wrong assumptions about what the issue could be, even after the updates didn't fix anything it just kept piling the "fixes" saying that all them were the cause. It only tried to change course when I told it I was sure those weren't the issues (but it kept the "fixes").
Some of the "fixes" were nonsense.
After every "fix" its language suggested that "they" (???) had tested the fix and that things were working (e.g., "In our testing under URP 17 the problem turned out to be that ..."), so it seems to be as confident about wrong stuff as other models.
At some point in the reasoning it seemed like it was on the correct track for the fix, but that never materialized in the answer.
After several attempts it started changing parts of the shader unrelated to the "fixes", including some that would change the shader output, even if slightly (it actually did unrelated changes since the beginning, but only on comments and other minor stuff).
In some answer it referred to "texture coordinates" as "texture coordinations", which is very weird... If I had seen that elsewhere I would have doubted it came from a recent ChatGPT model, or an LLM at all...
It never apologized for the repeated mistakes, so that's a +1 from me.

The thing is, there doesn't seem to be much information about converting that type of Unity shader from older to newer versions of URP, so even though it has seen examples of shaders using the latest URP version during training (as some of the reasoning texts tell me), it couldn't apply that information properly to do the conversion, which to me would be in line with the common statement that it needs to have enough examples of a task during training to be able to do that task, or similar ones.

As with previous models, I'm both very impressed and very not impressed. As always, it seems like it won't be specially useful for me, but I guess I'll do more tests when I'm bored...

DrewW · Jan 31, 2025

graylshaped said:
They did promise it fucks up 39% less frequently.

It was obviously trained on my annual self evaluation at work.

GaggiX · Jan 31, 2025

The model is impressively fast, hopefully the API will be available to lower tier than 3, and it will replace Claude Sonnet 3.5 in my case.

that guy strife · Jan 31, 2025

RichyRoo said:
This makes me sad.

My daughter (11) spent a few weeks of IT classes hand copying a pong implemention as a tutorial in Python, and was sooooo proud of herself that she figured out how to make a few quality of life and UI changes herself.

Don't see why you should be sad. She acquired knowledge, reasoning skills & used initiative to improve things.

That such a basic task can be automated is meaningless compared to the innovation your daughter can bring to the world.

saltmine · Jan 31, 2025

breze said:
How could they possibly be making money if they are giving access for free?

That's what we call a loss leader pricing strategy.

Auie · Feb 1, 2025

No weights, no interest; DeepSeek still better for this reason alone and (notactually)openAI can go suck it.

monkeyrun · Feb 1, 2025

breze said:
How could they possibly be making money if they are giving access for free?

none of these chatbot AIs are making money, but they are hoping their AI will replace hundreds of millions of human employees in the future.

richten · Feb 1, 2025

OpenAI says testers reported a 39 percent reduction in "major errors" when using o3-mini

"Still made errors, but there were less major ones".
Amazing.

Chmeee · Feb 1, 2025

Anybody using o1-(mini) to refactor code? Tried out o3 yet? I'll love to hear your experience. I've been working on a pipeline using Gemini 2 preview & o1-mini and results have been good. Even Sonnet gets lazy on longer files, but these two seem to have an unlimited appetite.

Bzored · Feb 1, 2025

Seeing as their name is OpenAI shouldn't all this shit be free anyways?

Bad Monkey! · Feb 1, 2025

breze said:
How could they possibly be making money if they are giving access for free?

Because it's not really free unless your use case is "dicking around". Unless you're on a "Pro" ($200/mo) or higher subscription, you only get a 150 queries every day, and if you're not on a paid subscription you get you get fewer features and lower priority in the queue.

rflnogueira · Feb 1, 2025

I've used Sonnet 3.5 as coding assistant, even though it has a few weaknesses. But DSR1 and o3-mini-high have clearly crossed a threshold for me. Given the right context, R1 is great at modifying existing code, adding functionalities. I can specify a design pattern in my prompt, inputs, etc. I've been testing o3, and the speed and accuracy seem on par with deepseek. Exciting times. They all seem terrible at software architecture, so this seems like a safe specialization for the near future.

rflnogueira · Feb 1, 2025

Bzored said:
Seeing as their name is OpenAI shouldn't all this shit be free anyways?

They're supposed to be a non profit, but they're moving away from that model. Also, open software does not mean free. You still have to run your infrastructure, pay people to install it, etc. free and open are two different things.

OpenAI hits back at DeepSeek with o3-mini reasoning model

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Praetorian

Ars Praetorian

Ars Praefectus

Ars Centurion

Ars Centurion

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Praetorian

Ars Centurion

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Seniorius Lurkius

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Centurion

Ars Centurion