Can today’s AI video models accurately model how the real world works?

Waco · Oct 1, 2025

They do not "reason" about the world at large at all. No extension of this current tech will get them to reasoning.

Lexus Lunar Lorry · Oct 1, 2025

Experience with confabulating LLMs has also shown there's often a large gap between a model generating a correct result some of the time and an upgraded model generating a correct result all of the time.

As a friend said, LLMs are great for demos but terrible for production. Some coworkers proudly showed off their vibe coded app, but then quietly admitted that it would take a year of work to get the app into a state where they would be willing to carry an on-call pager to support it.

gothmog1114 · Oct 1, 2025

Feels a bit like someone gave the infinite monkeys on infinite typewriters a bit of a nudge in the right direction, but still not going to get Shakespeare

MichaelLC · Oct 1, 2025

Can I have these folks do my next performance review? I'd be a rockstar.

picklefactory · Oct 1, 2025

an upgraded model generating a correct result all of the time.

And has this happened at some point?

Hypatia · Oct 1, 2025

One of the many crucial gaps between any “AI” and actually intelligent creatures is embodiment. This is more than simply “simulating a body” or even “having a body”. This is about being anchored to the world such that visceral changes to the world (which are constant) produce meaningful feedback to the entity in question.

This embedded feedback is vital because our intelligence isn’t just our conscious thought process. It flowers from our social and physical interactions with the actual world.

“AI” has nothing like that at all.

*edit for typo

Chuckstar · Oct 1, 2025

So their attitude is that if it’s right one in twelve times, it “understands” the world? What about the other eleven times…?

DaveSimmons · Oct 1, 2025

Yes, yes, and we'll have AGI by 2024 along with Full Self Driving by 2018.

XKCD Extrapolating: https://xkcd.com/605

"are on a path to becoming unified, generalist vision foundation models." But digging into the actual results of those experiments, the researchers seem to be grading today's video models on a bit of a curve and assuming future progress will smooth out many of today's highly inconsistent results.

Nope, "we'll improve from 8% to 99% any day now" is wildly overhyping what these LLMs will ever be able to do in a generalized way.

Yes, the next version will do better, but probably most of that will come from updating the training data and/or code managing the LLM to cover "rs in strawberry" directly. Not the mythical "zero shot" generalist reasoning.

Chuckstar · Oct 1, 2025

Hypatia said:
One of the many crucial gaps between any “AI” and actually intelligent creatures is embodiment. This is more than simply “simulating a body” or even “having a body”. This is about being anchored to the world such that visceral changes to the world (which are constant) produce meaningful feedback to the entity in question.

This embedded feedback is vital because our intelligence isn’t just our conscious thought process. It flowers from our social and physical interactions with the actual world.

“AI” has nothing like that at all.

*edit for typo

To me, the missing aspect is having a non-black-box model of the world, that can inform results and can be updated as necessary. It doesn’t have to interact with the world, just be able to understand it.

Missing Minute · Oct 1, 2025

It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:

It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.

Fred Duck · Oct 1, 2025

Kyle Orland said:
For the rest, the researchers write that "a success rate greater than 0 suggests that the model possesses the ability to solve the task."

If you give me twelve chances to answer a maths problem and I only solve it once, does that suggest I have the ability to solve the task?

What if I calculated it correctly once then missed the next eleven times?

Never bet against Ethan Nutterbutter.

BobbyBadoing · Oct 1, 2025

At this rate they’ll never be able to kill all humans

Chuckstar · Oct 1, 2025

omniron said:
if it was thought impossible a few years ago you would ever solve a math problem, then yeah actually solving it once is a huge advancement

Wow… sprinting with those goalposts, and still can’t get there before the ball does.

akw0088 · Oct 1, 2025

Think we need an ILSVRC equivalent for jar opening?

Emotion_ology · Oct 1, 2025

Waco said:
They do not "reason" about the world at large at all. No extension of this current tech will get them to reasoning.

arguably, reasoning is an emergent property we don't understand, and so it might well emerge from whatever additional things they decide to do with the current tech.

That said, on a practical level I'm right there with you.

ChronocidalManiac · Oct 1, 2025

gothmog1114 said:
Feels a bit like someone gave the infinite monkeys on infinite typewriters a bit of a nudge in the right direction, but still not going to get Shakespeare

Which is unfortunate for the AI, since his works are public domain.

tigas · Oct 1, 2025

Kids spend their first years playing with balls, cubes, shapes, jugs of water, learning about all that stuff, unprompted.

kaleberg · Oct 1, 2025

This is right out of Danny Dunn and the Homework Machine, a 1958 children's book. Danny invents a machine to do his and his friend's homework, but, as it turns out, it takes more time and work to teach / program the machine than it saves time and work to just do the homework. (It's like the protagonist in The Happy Years, a 1950 movie, working so hard to cheat on his gerund versus gerundive Latin test that he actually learns some Latin.)

Maybe you can save time and effort using these AI systems, but it's going to take a surprising amount of time and effort to do so. If they count success one in twelve times as satisfactory for a tool, that suggests that the mean number of trials to success is about six. That means using an AI tool would typically require half a dozen trials to get right, and you'd have to realize the first five were wrong which might take some doing. Such systems are non-deterministic, and unlike, let's say Python, completely opaque in operation. If you read the Python documentation - and I'm not picking on Python here - odds are you can figure out which edge case or conversion rule you are bumping into. You can't say RTFM for an LLM.

cleek · Oct 1, 2025

omniron said:
if it was thought impossible a few years ago you would ever solve a math problem, then yeah actually solving it once is a huge advancement

the fact that "34 + 23 = $rand()" is sometimes correct doesn't mean rand() solved the problem.

Joel622 · Oct 1, 2025

Betteridge's law stays winning.

kaleberg · Oct 1, 2025

Missing Minute said:
It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:
View attachment 119359
It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.

Does this have eigenvalues and eigenvectors?

atomic.banjo · Oct 1, 2025

picklefactory said:
And has this happened at some point?

It’s coming soon, we promise! We just need another $500 billion to get there.

kaleberg · Oct 1, 2025

JMTronicHobbyist said:
I agree with you I just want to play devil's advocate here - do human beings perform complex tasks correctly the first time, every time without instructions? Imagine asking a five-year-old to fry an egg without breaking the yolk, for instance. Would they get it right 3/5 times? 3/10? How many times would they break the yolk when they crack the egg? And you know even the successful ones will probably have some shell.

Five year olds can generalize, and they can learn by watching others. Also, once five year olds learn to do something, they tend to be able to do it repeatedly. There is a developmental effect where people backslide in certain skills, but if the failure is stochastic, they should see a developmental expert or a neurologist.

Decadre · Oct 1, 2025

"For the researchers, though, all of the above examples aren't evidence of failure but instead a sign of the model's capabilities. To be listed under the paper's 'failure cases,' Veo 3 had to fail a tested task across all 12 trials, which happened in 16 of the 62 tasks tested. For the rest, the researchers write that 'a success rate greater than 0 suggests that the model possesses the ability to solve the task.'"

I think that sums up not only the researchers thoughts on AI, but also a rather large segment of the population that view AI favorably.

Hey it got it right a couple of times, you just have to spam that CREATE button in Suno or reprompt Midjourney 132 times to get what you want....

Bongle · Oct 1, 2025

Decadre said:
Hey it got it right a couple of times, you just have to spam that CREATE button in Suno or reprompt Midjourney 132 times to get what you want....

AI CEO: "Man, we'll be as profitable as casinos! An addictive product that only produces a happy output intermittently, but keeps people clicking and clicking!"
Flunky: "Sir we lose money every time they press the button."
AI CEO: "Don't tell our investors"

CheyenneWills · Oct 1, 2025

I prompted Gemini (veo) with the following:

"Create a photo realistic video of a person shaking a rope. The rope is tied to a tree at waist height and is about 20 feet away"

Here is a description of the result.

-- There was a segment of rope tied to a tree at roughly waist height. The "other end" of the rope was laying on the ground not connected to anything.
-- There was a person holding a segment of rope that was stretched out in front of them and they were standing about 20 feet away from the tree. The "other end" of their rope was also laying on the ground not connected to anything.
-- The tree was orthogonal to the direction the person was facing.
-- As the person shook their segment of rope (which basically just flopped around on the ground in front of them), the segment tied to the tree also was flopping around.

So much for physical accuracy...

Postscript.. I tried my prompt a 2nd time and it got it correct... so much for consistency

radulov · Oct 1, 2025

No, they are 2d models that we extrapolate 3d from. (well, kind of). Here is a good test: "a man punching his fist through a monitor." Try it in a few models. You might find it difficult to make "an arrow going through a laptop monitor" or similar, too. It can layer things on top of each other, but not intersect, since it's... a 2d model.
My oversimplification.

hillspuck · Oct 1, 2025

omniron said:
if it was thought impossible a few years ago you would ever solve a math problem, then yeah actually solving it once is a huge advancement

This is "solving" problem like Google's old "I'm feeling lucky" button is "solving" search queries. Only it's like a search engine that fails 11 out of 12 times.

GrimPloughman · Oct 1, 2025

Imo the Vijayaraghavan’s robot (covered in another Ars article) is the closest to what we consider as reasoning.

The robot has an arm, a gripper and a simple camera. It can manipulate blocks of different colors in response to simple prompts like “move red left,” “move blue right,” or “put red on blue.”

It's controlled by four interconnected neural networks that combine the information about language prompts, the arm current position and the visual input from the camera, to perform the actions.

This is imo a very, very simplified model of the actual reasoning that is basically a real life simulation of the world performed by our brains. We can simulate the world e.g. to check what are the most expected results of different actions (this is what we consider logic or common sense), or we can examine the model itself to figure things out (we consider it introspection).

Fatesrider · Oct 1, 2025

While the researchers acknowledge that Veo 3's performance is "not yet perfect," they point to "consistent improvement from Veo 2 to Veo 3" in suggesting that future video models "will become general-purpose foundation models for vision, just as LLMs have for language." And the researchers do have some data on their side for this argument.

Oh, fuck you...

That exactly describes what proponents are saying about LLM's, their path, trajectory and very likely the same outcomes. A lot of work to produce mediocre to subpar results while cooking half the planet at the same time.

When asked to model a Bunsen burner turning on and burning a piece of paper, it similarly failed nine out of 12 times.

That video was patently wrong, too. Paper burns from the source of flame, meaning the middle to the outsides, not evenly from the bottom up, and not that slowly, and with a lot more ash and blow-off from the hot gasses in the center.

The hands with the ball? Seriously, one of the most telling giveaways it's AI is the slow speed of motion. AI can't handle 4K video, or even video at any resolution, in REAL TIME SPEED. So you get "robotic hands tossing ball in the air on the moon" footage, which would also be nonsense.

Not to mention the missing background details there are in all the videos.

IMHO, whatever progress they have now will do the same things as LLM's and get slower and slower because there's ever more fine and complicated issues to overcome along the way. Much like the issue with AV cars, that last 2% is a killer, literally and figuratively. Those pesky minor details that are barely noticed provide the vast majority of reality to reality. And, to date, AI always fucks it up badly. They're still trying to make the major shit look real (and not really hitting that mark). That minor shit is what will do them in.

Betteridge's law applies here. Having seen a lot of AI-generated slop, its ALWAYS easy to spot the mistakes. So using AI to model how the real world works will end in tears, too.

Hypatia · Oct 1, 2025

Chuckstar said:
To me, the missing aspect is having a non-black-box model of the world, that can inform results and can be updated as necessary. It doesn’t have to interact with the world, just be able to understand it.

From my perspective, it seems that interacting with the world just is part of understanding the world. I totally agree about the non-black-box aspect, however.

tigas · Oct 1, 2025

radulov said:
No, they are 2d models that we extrapolate 3d from. (well, kind of). Here is a good test: "a man punching his fist through a monitor." Try it in a few models. You might find it difficult to make "an arrow going through a laptop monitor" or similar, too. It can layer things on top of each other, but not intersect, since it's... a 2d model.
My oversimplification.

They were mostly trained on pictures, not on the real world of which the pictures are a flat projection. They have no actual concept of what the objects look like in 3D space.

Steve austin · Oct 1, 2025

I didn’t sink any baskets last week. I managed 1 this week. I figure in 2 weeks I’ll be ready for the NBA!

S-T-R · Oct 1, 2025

Fatesrider said:
The hands with the ball? Seriously, one of the most telling giveaways it's AI is the slow speed of motion. AI can't handle 4K video, or even video at any resolution, in REAL TIME SPEED. So you get "robotic hands tossing ball in the air on the moon" footage, which would also be nonsense.

It also doesn't treat the two hands identically. The wrists operate differently (also nonsensically), which the right one (from viewers perspective) flexing what looks like a fixed support element. The left not so much. It doesn't have a concept of how robotic hands would actually work, so it subs in how human hands appear. Human skin flexes when arms and hand move, so it applied that to a machine.

Classic AI miscategorization/generalization errors. It both undergeneralizes (the hands should work the same but don't) and overgeneralization (robot hands work like human hands).

To your point, I have seen near-real time image generators. They make all of these issues worse because they simply don't have enough time. They're stuck at the original "Will Smith Eating Spaghetti" level, except they decohere within seconds too.

Nilt · Oct 1, 2025

Missing Minute said:
It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:
View attachment 119359
It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.

This is the thing that really pisses me off about all this. As I've mentioned here before, I own and operate an IT consulting business, albeit one that's just me nowadays, and used to own and operate a small beverage company with a couple dozen employees including some drivers. In either of these businesses, if I got caught pulling the same sort of circular bullshit with finances, I'd have been charged with fraud and rightfully so. I really don't understand how this stuff is different. The claimed end game of making literally more money than the entire world's GDP is literally not possible! That should absolutely be prosecuted as fraud. It's OK, though, because the folks doing it are wealthy?! Christ, I hate this timeline so much.

Don Reba · Oct 1, 2025

CheyenneWills said:
I prompted Gemini (veo) with the following:

"Create a photo realistic video of a person shaking a rope. The rope is tied to a tree at waist height and is about 20 feet away"

Here is a description of the result.

-- There was a segment of rope tied to a tree at roughly waist height. The "other end" of the rope was laying on the ground not connected to anything.
-- There was a person holding a segment of rope that was stretched out in front of them and they were standing about 20 feet away from the tree. The "other end" of their rope was also laying on the ground not connected to anything.
-- The tree was orthogonal to the direction the person was facing.
-- As the person shook their segment of rope (which basically just flopped around on the ground in front of them), the segment tied to the tree also was flopping around.

So much for physical accuracy...

Postscript.. I tried my prompt a 2nd time and it got it correct... so much for consistency

Well, those are hard instructions to follow. You asked for a person shaking a rope that is 20 feet away.

Chuckstar · Oct 1, 2025

JMTronicHobbyist said:
I agree with you I just want to play devil's advocate here - do human beings perform complex tasks correctly the first time, every time without instructions? Imagine asking a five-year-old to fry an egg without breaking the yolk, for instance. Would they get it right 3/5 times? 3/10? How many times would they break the yolk when they crack the egg? And you know even the successful ones will probably have some shell.

Except fixed neural nets like these don't learn from their mistakes. They entirely lack that feedback loop.

Can today’s AI video models accurately model how the real world works?

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praetorian

Ars Centurion

Ars Praetorian

Ars Centurion

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Centurion

Ars Legatus Legionis

Ars Centurion

Ars Praetorian

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Praefectus

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Centurion

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Praefectus

Ars Legatus Legionis