personalized text to video dreambooth

Nvidia Demos Personalized Text-To-Video Creation Via Dreambooth

Dreambooth has been one of the most popular ways to create personalized images through Stable Diffusion, but it might be about to get a lot more exciting.

Nvidia has demonstrated the ability to create videos out of personalized subjects through Dreambooth. In a paper titled “Align your Latents:
High-Resolution Video Synthesis with Latent Diffusion Models,” Nvidia demonstrates its text-to-video capabilities. “We turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show the first results for personalized text-to-video generation, opening exciting directions for future content creation,” the paper says.

“The generated videos have a resolution of 1280 x 2048 pixels, consist of 113 frames and are rendered at 24 fps, resulting in 4.7 second long clips. Our Video LDM for text-to-video generation is based on Stable Diffusion and has a total of 4.1B parameters, including all components except the CLIP text encoder. Only 2.7B of these parameters are trained on videos. This means that our models are significantly smaller than those of several concurrent works. Nevertheless, we can produce high-resolution, temporally consistent and diverse videos. This can be attributed to the efficient LDM approach,” the paper adds.

The paper not only demos text-to-video use cases, such as showing a Storm Trooper vacuuming a beach, or a teddy bear playing the guitar, it also allows for personalized video generation through Dreambooth. “We insert the temporal layers that were trained for our Video LDM for text-to-video synthesis into image LDM backbones that we previously fine-tuned on a set of images following DreamBooth. The temporal layers generalize to the DreamBooth checkpoints, thereby enabling personalized text-to-video generation,” the paper says.

Thus, after supplying just five images of a cat, the approach is able to generate 4-second long videos of a cat playing in grass or a cat getting up. Similarly, on supplying 5 photos of a building, the approach is able to generate videos of the building standing next to the Eiffel Tower, or have the building on an island with waves crashing next to it.

This is a interesting result, and will likely lead to the creation of hundreds of personalized text-to-video tools. Dreambooth’s personalized text-to-image capabilities had spawned many sites that allowed for the generation of AI avatars, and it’s likely that Dreambooth’s text to video personalization could lead to a similar gold rush. The code for NVidia’s approach isn’t currently open-source, but there’s clearly more developments in personalized text-to-video coming in the following weeks.