Abstract

This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 90% of the time.

Overall pipeline

The pipeline of VFusion3D. We first use a small amount of 3D data to fine-tune a video diffusion model, transforming it into a multi-view vodeo generator that functions as a data engine. By generating a large amount of synthetic data, we train VFusion3D to generate a 3D representation and render novel views.

Results

Generated Images (Text-Image-3D)

Single Image 3D Reconstruction

User Study

Scaling!

Scaling with the number of synthetic data

The left and right figures display the LPIPS and CLIP image similarity scores in relation to the dataset size, respectively. The generation quality consistently improves as the dataset size increases

Scaling with other factors

Our approach can also scale and improve with several other factors. These include the development of stronger video diffusion models, the availability of more 3D data for fine-tuning the video diffusion model and the pre-trained 3D generative model, and the advancement of large 3D feed-forward generative models. All these factors contribute to the scalability of our model, positioning it as a promising avenue for foundation 3D generative models.

Paper

Han et al.
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
preprint
(hosted on ArXiv)

[Bibtex]

Acknowledgements

Junlin Han is supported by Meta. We would like to thank Jianyuan Wang, Luke Melas-Kyriazi, Yawar Siddiqui, Quankai Gao, Yanir Kleiman, Roman Shapovalov, Natalia Neverova, and Andrea Vedaldi for the insightful discussions and invaluable support.