nVidia’s new text-to-video AI shows an insane rate of progress

A stormtrooper vacuums at the beach … except the vacuum head is a pool cleaner and it’s plugged into his butt. We live in interesting times.  nVidia

Presented at the IEEE Conference on Computer Vision and Pattern Recognition 2023, nVidia’s new video generator starts out as a Latent Diffusion Model (LDM) trained to generate images from text, and then introduces an extra step in which it attempts to animate the image using what it’s learned from studying thousands of existing videos.

This adds time as a tracked dimension, and the LDM is tasked with estimating what’s likely to change in each area of an image over a certain period. It creates a number of keyframes throughout the sequence, then uses another LDM to interpolate the frames in between the keyframes, generating images of similar quality for every image in the sequence.

nVidia tested the system using low-quality dashcam-style footage, and found that it was capable of generating several minutes’ worth of this kind of video in a “temporally coherent” fashion, at 512 x 1024-pixel resolution – an unprecedented feat in this fast-moving field.

But it’s also capable of operating at much higher resolutions and across an enormous range of other visual styles. The team used the system to generate a plethora of sample videos in 1280 x 2048-pixel resolution, simply from text prompts. These videos each contain 113 frames, and are rendered at 24 fps, so they’re about 4.7 seconds long. Pushing much further than that in terms of total time seems to break things, and introduces a lot more weirdness.

They’re still clearly AI-generated, and there are still plenty of weird mistakes to be found. It’s also kind of obvious where the keyframes are in many of the videos, with some odd speeding and slowing of motion around them. But in sheer image quality, these are an incredible leap forward from what we saw with ModelScope at the start of this month.

It’s pretty incredible to watch these amazing AI systems in these formative days, beginning to understand how images and videos work. Think of all the things they need to figure out – three-dimensional space, for one, and how a realistic parallax effect might follow if a camera is moved. Then there’s how liquids behave, from the spray-flinging spectacle of waves crashing against rocks at sunset, to the gently expanding wake left by a swimming duck, to the way steamed milk mingles and foams as you pour it into coffee.

Then there’s the subtly shifting reflections on a rotating bowl of grapes. Or the way a field of flowers moves in the wind. Or the way flames propagate along logs in a campfire and lick upwards at the sky. That’s to say nothing of the massive variety of human and animal behaviors it needs to recreate.