In June, Runway debuted a new text-to-video synthesis model called Gen-3 Alpha. It converts written descriptions called “prompts” into HD video clips without sound. We’ve since had a chance to use it and wanted to share our results. Our tests show that careful prompting isn’t as important as matching concepts likely found in the training data, and that achieving amusing results likely requires many generations and selective cherry-picking.
An enduring theme of all generative AI models we’ve seen since 2022 is that they can be excellent at mixing concepts found in training data but are typically very poor at generalizing (applying learned “knowledge” to new situations the model has not explicitly been trained on). That means they can excel at stylistic and thematic novelty but struggle at fundamental structural novelty that goes beyond the training data.
What does all that mean? In the case of Runway Gen-3, lack of generalization means you might ask for a sailing ship in a swirling cup of coffee, and provided that Gen-3’s training data includes video examples of sailing ships and swirling coffee, that’s an “easy” novel combination for the model to make fairly convincingly. But if you ask for a cat drinking a can of beer (in a beer commercial), it will generally fail because there aren’t likely many videos of photorealistic cats drinking human beverages in the training data. Instead, the model will pull from what it has learned about videos of cats and videos of beer commercials and combine them. The result is a cat with human hands pounding back a brewsky.
(Update: Runway has not revealed where it got its training data, but after the publication of this article, 404 Media posted a report that seems to show that much of the video data came from an unauthorized scrape of YouTube videos.)
A few basic prompts
During the Gen-3 Alpha testing phase, we signed up for Runway’s Standard plan, which provides 625 credits for $15 a month, plus some bonus free trial credits. Each generation costs 10 credits per one second of video, and we created 10-second videos for 100 credits a piece. So the quantity of generations we could make were limited.



So that's at least one thing this video generator is good for, right there: manufacturing goofy memes.
AI finds (and adversarial generative algorithms replicate) patterns in data, but it has no internal understanding of what that data represents beyond "tags." Cat tag + drinking soda tag = cats with hands. It's really that simple, and it will never change without a substantial (not incremental) restructuring of ML algorithms.