Zero-1-to-3: Zero-shot One Image to 3D Object

Columbia University

Columbia University

Columbia University

Toyota Research Institute

Toyota Research Institute

Columbia University

TL;DR: We learn to control the camera perspective in large-scale diffusion models, enabling zero-shot novel view synthesis and 3D reconstruction from a single image.

We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.


We learn a view-conditioned diffusion model that can subsequently control the viewpoint of an image containing a novel object (left). Such diffusion model can also be used to train a NeRF for 3D reconstruction (right). Please refer to our paper for more details or checkout our code for implementation.

Novel View Synthesis

Here are some uncurated inference results from in-the-wild images we tried, along with images from the Google Scanned Objects and RTMV datsets. Note that the demo allows a limited selection of rotation angles quantized by 30 degrees due to limited storage space of the hosting server. If you want to try out a fully custom demo running on a GPU server which allows you to upload your own image, please check the live demo above or run one locally with our code!

Text to Image to Novel Views

Here are results of applying Zero-1-to-3 to images generated by Dall-E-2.

Single-View 3D Reconstruction (Geometry)

Here are results of applying Zero-1-to-3 to obtain a full 3D reconstruction from the input image shown on the left. We compare our reconstruction with state-of-the-art models in single-view 3D reconstruction.

Single-View 3D Reconstruction (Texture)

