Summary: Flex3D is a two-stage pipeline that generates high-quality 3D assets from single images or text prompts.

A diagram explaining the method in broad strokes, like explained in the caption.
Flex3D comprises two stages: (1) candidate view generation and selection, and (2) 3D reconstruction using FlexRM. In the first stage, an input image or textual prompt drives the generation of a diverse set of candidate views through fine-tuned multi-view and video diffusion models. These views are subsequently filtered based on quality and consistency using a view selection mechanism. The second stage leverages the selected high-quality views, feeding them to FlexRM which reconstruct the 3D object using a tri-plane representation decoded into 3D Gaussians.


Junlin Han is supported by Meta. We would like to thank Luke Melas-Kyriazi, Runjia Li, Yawar Siddiqui, Minghao Chen, David Novotny, and Natalia Neverova for the helpful discussions and support.