rubber reality

Is China pulling ahead in AI video synthesis? We put Minimax to the test.

With China’s AI video generators pushing memes into weird territory, it was time to test one out.

Benj Edwards – Oct 9, 2024 5:33 pm | 43

A still shot from an AI-generated Minimax video-01 video with the prompt: "A highly-intelligent person reading 'Ars Technica' on their computer when the screen explodes" Credit: Minimax

If 2022 was the year AI image generators went mainstream, 2024 has arguably been the year that AI video synthesis models exploded in capability. These models, while not yet perfect, can generate new videos from text descriptions called prompts, still images, or existing videos. After OpenAI made waves with Sora in February, two major AI models emerged from China: Kuaishou Technology’s Kling and Minimax’s video-01.

Both Chinese models have already powered numerous viral AI-generated video projects, accelerating meme culture in weird new ways, including a recent shot-for-shot translation of the Princess Mononoke trailer using Kling that inspired death threats and a series of videos created with Minimax’s platform. The videos show a synthesized version of TV chef Gordon Ramsay doing ridiculous things.

After 22 million views and thousands of death threats, I felt like I needed to take this post down for my own mental health.
This trailer was an EXPERIMENT to show my 300 friends on X how far we've coming in 16 months.
I'm putting it back up to keep the conversation going. 🧵 pic.twitter.com/tFpRPm9BMv
— PJ Ace (@PJaccetturo) October 8, 2024

Kling first emerged in June, and it can generate two minutes of 1080p HD video at 30 frames per second with a level of detail and coherency that some think surpasses Sora. It’s currently only available to people with a Chinese telephone number, and we have not yet used it ourselves.

Around September 1, Minimax debuted the aforementioned video-01 as part of its Hailuo AI platform. That site lets anyone generate videos based on a prompt, and initial results seemed similar to Kling, so we decided to run some of our Runway Gen-3 prompts through it to see what happens.

Putting Minimax to the test

We generated each of the six-second-long 720p videos seen below using Minimax’s free Hailuo AI platform. Each video generation took up to five to 10 minutes to complete, likely due to being in a queue with other free video users. (At one point, the whole thing froze up on us for a few days, so we didn’t get a chance to generate a flaming cheeseburger.)

In the spirit of not cherry-picking any results, everything you see was the first generation we received for the prompt listed above it.

“A highly intelligent person reading ‘Ars Technica’ on their computer when the screen explodes”

“A cat in a car drinking a can of beer, beer commercial”

“Will Smith eating spaghetti”

“Robotic humanoid animals with vaudeville costumes roam the streets collecting protection money in tokens”

“A basketball player in a haunted passenger train car with a basketball court, and he is playing against a team of ghosts”

“A herd of one million cats running on a hillside, aerial view”

“Video game footage of a dynamic 1990s third-person 3D platform game starring an anthropomorphic shark boy”

“A muscular barbarian breaking a CRT television set with a weapon, cinematic, 8K, studio lighting”

Limitations of video synthesis models

Overall, the Minimax video-01 results seen above feel fairly similar to Gen-3’s outputs, with some differences, like the lack of a celebrity filter on Will Smith (who sadly did not actually eat the spaghetti in our tests), and the more realistic cat hands and licking motion. Some results were far worse, like the 1 million cats and the Ars Technica reader.

As we explained in our hands-on test for Runway’s Gen-3 Alpha, text-to-video models typically excel at combining concepts present in their training data (existing video samples used to create the model), allowing for creative mashups of existing themes and styles. However, these AI models often struggle with generalization, meaning they have difficulty applying learned information to entirely novel scenarios not represented in their training data.

This limitation can lead to unexpected or unintended results when users request scenarios that deviate too far from the model’s training examples. While we saw a very comical result for the cat drinking beer in the Gen-3 test, Minimax rendered a more realistic-looking result, and that could come down to better parsing of the prompt, different training data, more compute in training the model, or a different model architecture. Ultimately, there’s still a lot of trial and error in generating a coherent result.

It’s worth noting that while China’s models seem to match US video synthesis models from earlier this year, American tech companies aren’t standing still. Google showed off Veo in May with some very impressive-looking demos. And last week, we reported on Meta’s Movie Gen model, which appears (without using Meta’s model ourselves) to potentially be a step ahead of Minimax and Kling. But China’s servers are doubtlessly cranking away at training new AI video models as we speak, so this deepfake arms race probably won’t slow down any time soon.

Benj Edwards Senior AI Reporter

Benj Edwards was a reporter at Ars Technica covering artificial intelligence and technology history.

43 Comments