Can today’s AI video models accurately model how the real world works?

Lexus Lunar Lorry

Ars Scholae Palatinae
942
Subscriptor++
Experience with confabulating LLMs has also shown there's often a large gap between a model generating a correct result some of the time and an upgraded model generating a correct result all of the time.
As a friend said, LLMs are great for demos but terrible for production. Some coworkers proudly showed off their vibe coded app, but then quietly admitted that it would take a year of work to get the app into a state where they would be willing to carry an on-call pager to support it.
 
Upvote
94 (96 / -2)

Hypatia

Ars Centurion
281
Subscriptor
One of the many crucial gaps between any “AI” and actually intelligent creatures is embodiment. This is more than simply “simulating a body” or even “having a body”. This is about being anchored to the world such that visceral changes to the world (which are constant) produce meaningful feedback to the entity in question.

This embedded feedback is vital because our intelligence isn’t just our conscious thought process. It flowers from our social and physical interactions with the actual world.

“AI” has nothing like that at all.

*edit for typo
 
Upvote
25 (28 / -3)

DaveSimmons

Ars Legatus Legionis
11,021
Yes, yes, and we'll have AGI by 2024 along with Full Self Driving by 2018.

XKCD Extrapolating: https://xkcd.com/605

"are on a path to becoming unified, generalist vision foundation models." But digging into the actual results of those experiments, the researchers seem to be grading today's video models on a bit of a curve and assuming future progress will smooth out many of today's highly inconsistent results.

Nope, "we'll improve from 8% to 99% any day now" is wildly overhyping what these LLMs will ever be able to do in a generalized way.

Yes, the next version will do better, but probably most of that will come from updating the training data and/or code managing the LLM to cover "rs in strawberry" directly. Not the mythical "zero shot" generalist reasoning.
 
Upvote
41 (44 / -3)

Chuckstar

Ars Legatus Legionis
37,478
Subscriptor
One of the many crucial gaps between any “AI” and actually intelligent creatures is embodiment. This is more than simply “simulating a body” or even “having a body”. This is about being anchored to the world such that visceral changes to the world (which are constant) produce meaningful feedback to the entity in question.

This embedded feedback is vital because our intelligence isn’t just our conscious thought process. It flowers from our social and physical interactions with the actual world.

“AI” has nothing like that at all.

*edit for typo
To me, the missing aspect is having a non-black-box model of the world, that can inform results and can be updated as necessary. It doesn’t have to interact with the world, just be able to understand it.
 
Upvote
6 (7 / -1)

Missing Minute

Wise, Aged Ars Veteran
1,386
It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:
1759338833401.png

It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.
 
Upvote
60 (62 / -2)

Fred Duck

Ars Tribunus Angusticlavius
7,437
Kyle Orland said:
For the rest, the researchers write that "a success rate greater than 0 suggests that the model possesses the ability to solve the task."
If you give me twelve chances to answer a maths problem and I only solve it once, does that suggest I have the ability to solve the task?

What if I calculated it correctly once then missed the next eleven times?

Never bet against Ethan Nutterbutter.
 
Upvote
8 (10 / -2)
Post content hidden for low score. Show…
They do not "reason" about the world at large at all. No extension of this current tech will get them to reasoning.
arguably, reasoning is an emergent property we don't understand, and so it might well emerge from whatever additional things they decide to do with the current tech.

That said, on a practical level I'm right there with you.
 
Upvote
0 (12 / -12)

kaleberg

Ars Scholae Palatinae
1,270
Subscriptor
This is right out of Danny Dunn and the Homework Machine, a 1958 children's book. Danny invents a machine to do his and his friend's homework, but, as it turns out, it takes more time and work to teach / program the machine than it saves time and work to just do the homework. (It's like the protagonist in The Happy Years, a 1950 movie, working so hard to cheat on his gerund versus gerundive Latin test that he actually learns some Latin.)

Maybe you can save time and effort using these AI systems, but it's going to take a surprising amount of time and effort to do so. If they count success one in twelve times as satisfactory for a tool, that suggests that the mean number of trials to success is about six. That means using an AI tool would typically require half a dozen trials to get right, and you'd have to realize the first five were wrong which might take some doing. Such systems are non-deterministic, and unlike, let's say Python, completely opaque in operation. If you read the Python documentation - and I'm not picking on Python here - odds are you can figure out which edge case or conversion rule you are bumping into. You can't say RTFM for an LLM.
 
Upvote
31 (33 / -2)
Post content hidden for low score. Show…

kaleberg

Ars Scholae Palatinae
1,270
Subscriptor
It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:
View attachment 119359
It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.
Does this have eigenvalues and eigenvectors?
 
Upvote
1 (2 / -1)

kaleberg

Ars Scholae Palatinae
1,270
Subscriptor
I agree with you I just want to play devil's advocate here - do human beings perform complex tasks correctly the first time, every time without instructions? Imagine asking a five-year-old to fry an egg without breaking the yolk, for instance. Would they get it right 3/5 times? 3/10? How many times would they break the yolk when they crack the egg? And you know even the successful ones will probably have some shell.
Five year olds can generalize, and they can learn by watching others. Also, once five year olds learn to do something, they tend to be able to do it repeatedly. There is a developmental effect where people backslide in certain skills, but if the failure is stochastic, they should see a developmental expert or a neurologist.
 
Upvote
42 (42 / 0)

Decadre

Smack-Fu Master, in training
91
"For the researchers, though, all of the above examples aren't evidence of failure but instead a sign of the model's capabilities. To be listed under the paper's 'failure cases,' Veo 3 had to fail a tested task across all 12 trials, which happened in 16 of the 62 tasks tested. For the rest, the researchers write that 'a success rate greater than 0 suggests that the model possesses the ability to solve the task.'"

I think that sums up not only the researchers thoughts on AI, but also a rather large segment of the population that view AI favorably.

Hey it got it right a couple of times, you just have to spam that CREATE button in Suno or reprompt Midjourney 132 times to get what you want....
 
Upvote
24 (25 / -1)

Bongle

Ars Praefectus
4,497
Subscriptor++
Hey it got it right a couple of times, you just have to spam that CREATE button in Suno or reprompt Midjourney 132 times to get what you want....
AI CEO: "Man, we'll be as profitable as casinos! An addictive product that only produces a happy output intermittently, but keeps people clicking and clicking!"
Flunky: "Sir we lose money every time they press the button."
AI CEO: "Don't tell our investors"
 
Upvote
34 (34 / 0)

CheyenneWills

Wise, Aged Ars Veteran
119
I prompted Gemini (veo) with the following:

"Create a photo realistic video of a person shaking a rope. The rope is tied to a tree at waist height and is about 20 feet away"

Here is a description of the result.

-- There was a segment of rope tied to a tree at roughly waist height. The "other end" of the rope was laying on the ground not connected to anything.
-- There was a person holding a segment of rope that was stretched out in front of them and they were standing about 20 feet away from the tree. The "other end" of their rope was also laying on the ground not connected to anything.
-- The tree was orthogonal to the direction the person was facing.
-- As the person shook their segment of rope (which basically just flopped around on the ground in front of them), the segment tied to the tree also was flopping around.

So much for physical accuracy...

Postscript.. I tried my prompt a 2nd time and it got it correct... so much for consistency o_O
 
Last edited:
Upvote
13 (13 / 0)

radulov

Wise, Aged Ars Veteran
181
Subscriptor++
No, they are 2d models that we extrapolate 3d from. (well, kind of). Here is a good test: "a man punching his fist through a monitor." Try it in a few models. You might find it difficult to make "an arrow going through a laptop monitor" or similar, too. It can layer things on top of each other, but not intersect, since it's... a 2d model.
My oversimplification.
 
Upvote
6 (6 / 0)

hillspuck

Ars Scholae Palatinae
2,179
if it was thought impossible a few years ago you would ever solve a math problem, then yeah actually solving it once is a huge advancement
This is "solving" problem like Google's old "I'm feeling lucky" button is "solving" search queries. Only it's like a search engine that fails 11 out of 12 times.
 
Upvote
15 (17 / -2)

GrimPloughman

Wise, Aged Ars Veteran
158
Imo the Vijayaraghavan’s robot (covered in another Ars article) is the closest to what we consider as reasoning.

The robot has an arm, a gripper and a simple camera. It can manipulate blocks of different colors in response to simple prompts like “move red left,” “move blue right,” or “put red on blue.”

It's controlled by four interconnected neural networks that combine the information about language prompts, the arm current position and the visual input from the camera, to perform the actions.

This is imo a very, very simplified model of the actual reasoning that is basically a real life simulation of the world performed by our brains. We can simulate the world e.g. to check what are the most expected results of different actions (this is what we consider logic or common sense), or we can examine the model itself to figure things out (we consider it introspection).
 
Upvote
2 (2 / 0)

Fatesrider

Ars Legatus Legionis
25,499
Subscriptor
While the researchers acknowledge that Veo 3's performance is "not yet perfect," they point to "consistent improvement from Veo 2 to Veo 3" in suggesting that future video models "will become general-purpose foundation models for vision, just as LLMs have for language." And the researchers do have some data on their side for this argument.
Oh, fuck you...

That exactly describes what proponents are saying about LLM's, their path, trajectory and very likely the same outcomes. A lot of work to produce mediocre to subpar results while cooking half the planet at the same time.
When asked to model a Bunsen burner turning on and burning a piece of paper, it similarly failed nine out of 12 times.
That video was patently wrong, too. Paper burns from the source of flame, meaning the middle to the outsides, not evenly from the bottom up, and not that slowly, and with a lot more ash and blow-off from the hot gasses in the center.

The hands with the ball? Seriously, one of the most telling giveaways it's AI is the slow speed of motion. AI can't handle 4K video, or even video at any resolution, in REAL TIME SPEED. So you get "robotic hands tossing ball in the air on the moon" footage, which would also be nonsense.

Not to mention the missing background details there are in all the videos.

IMHO, whatever progress they have now will do the same things as LLM's and get slower and slower because there's ever more fine and complicated issues to overcome along the way. Much like the issue with AV cars, that last 2% is a killer, literally and figuratively. Those pesky minor details that are barely noticed provide the vast majority of reality to reality. And, to date, AI always fucks it up badly. They're still trying to make the major shit look real (and not really hitting that mark). That minor shit is what will do them in.

Betteridge's law applies here. Having seen a lot of AI-generated slop, its ALWAYS easy to spot the mistakes. So using AI to model how the real world works will end in tears, too.
 
Upvote
16 (19 / -3)

Hypatia

Ars Centurion
281
Subscriptor
To me, the missing aspect is having a non-black-box model of the world, that can inform results and can be updated as necessary. It doesn’t have to interact with the world, just be able to understand it.
From my perspective, it seems that interacting with the world just is part of understanding the world. I totally agree about the non-black-box aspect, however.
 
Upvote
0 (0 / 0)

tigas

Ars Tribunus Angusticlavius
7,420
Subscriptor
No, they are 2d models that we extrapolate 3d from. (well, kind of). Here is a good test: "a man punching his fist through a monitor." Try it in a few models. You might find it difficult to make "an arrow going through a laptop monitor" or similar, too. It can layer things on top of each other, but not intersect, since it's... a 2d model.
My oversimplification.
They were mostly trained on pictures, not on the real world of which the pictures are a flat projection. They have no actual concept of what the objects look like in 3D space.
 
Upvote
5 (5 / 0)

S-T-R

Ars Scholae Palatinae
609
The hands with the ball? Seriously, one of the most telling giveaways it's AI is the slow speed of motion. AI can't handle 4K video, or even video at any resolution, in REAL TIME SPEED. So you get "robotic hands tossing ball in the air on the moon" footage, which would also be nonsense.

It also doesn't treat the two hands identically. The wrists operate differently (also nonsensically), which the right one (from viewers perspective) flexing what looks like a fixed support element. The left not so much. It doesn't have a concept of how robotic hands would actually work, so it subs in how human hands appear. Human skin flexes when arms and hand move, so it applied that to a machine.

Classic AI miscategorization/generalization errors. It both undergeneralizes (the hands should work the same but don't) and overgeneralization (robot hands work like human hands).

To your point, I have seen near-real time image generators. They make all of these issues worse because they simply don't have enough time. They're stuck at the original "Will Smith Eating Spaghetti" level, except they decohere within seconds too.
 
Upvote
8 (8 / 0)

Nilt

Ars Legatus Legionis
21,841
Subscriptor++
It's a crazy bubble full of accounting tricks to keep it afloat, there's a flowchart of the relationships between AI companies but it is so absurdly complex that it's unreadable, here's a table:
View attachment 119359
It's all money going in circles and there isn't enough available VC money or budgets available for these companies to build the infrastructure they say they need to build.
This is the thing that really pisses me off about all this. As I've mentioned here before, I own and operate an IT consulting business, albeit one that's just me nowadays, and used to own and operate a small beverage company with a couple dozen employees including some drivers. In either of these businesses, if I got caught pulling the same sort of circular bullshit with finances, I'd have been charged with fraud and rightfully so. I really don't understand how this stuff is different. The claimed end game of making literally more money than the entire world's GDP is literally not possible! That should absolutely be prosecuted as fraud. It's OK, though, because the folks doing it are wealthy?! Christ, I hate this timeline so much.
 
Upvote
20 (20 / 0)

Don Reba

Ars Praefectus
3,349
Subscriptor++
I prompted Gemini (veo) with the following:

"Create a photo realistic video of a person shaking a rope. The rope is tied to a tree at waist height and is about 20 feet away"

Here is a description of the result.

-- There was a segment of rope tied to a tree at roughly waist height. The "other end" of the rope was laying on the ground not connected to anything.
-- There was a person holding a segment of rope that was stretched out in front of them and they were standing about 20 feet away from the tree. The "other end" of their rope was also laying on the ground not connected to anything.
-- The tree was orthogonal to the direction the person was facing.
-- As the person shook their segment of rope (which basically just flopped around on the ground in front of them), the segment tied to the tree also was flopping around.

So much for physical accuracy...

Postscript.. I tried my prompt a 2nd time and it got it correct... so much for consistency o_O
Well, those are hard instructions to follow. You asked for a person shaking a rope that is 20 feet away.
 
Upvote
8 (8 / 0)

Chuckstar

Ars Legatus Legionis
37,478
Subscriptor
I agree with you I just want to play devil's advocate here - do human beings perform complex tasks correctly the first time, every time without instructions? Imagine asking a five-year-old to fry an egg without breaking the yolk, for instance. Would they get it right 3/5 times? 3/10? How many times would they break the yolk when they crack the egg? And you know even the successful ones will probably have some shell.
Except fixed neural nets like these don't learn from their mistakes. They entirely lack that feedback loop.
 
Upvote
8 (9 / -1)