RealFusion：

360° Reconstruction of Any Object from a Single Image

• Oxford University

• 2023.2.23

Demo

https://lukemelas.github.io/realfusion/

Motivation: Single-View 3D Reconstruction

• Reconstructing the 3D structure of an object from a single 2D view is a fundamental challenge in computer vision.
• In the case of a single view, the reconstruction problem is highly ill-posed. As a result, the task requires semantic understanding obtained by learning. Despite the difficulty of this task, humans are adept at using a range of monocular cues to infer the 3D structures of objects from single views.

Background

Category-level 3D Reconstruction

• Most prior work tackles the problem of category-specific single-view 3D reconstruction by training a category-level reconstruction model.
• The work: Going beyond category-level 3D reconstruction
• This work aims to go beyond category-specific images to images of arbitrary objects. This setting is highly challenging, but humans perform it effortlessly when they observe new objects.

Single-View 3D Reconstruction

• Arbitrary-object 3D reconstruction has been challenging because the problem fundamentally requires the use of large-scale 3D priors over object shapes, which have not been available.
• With the recent rise of large-scale pretraining, this problem has become tractable. Examples include:
• Contrastive: CLIP
• Autoregressive: DALL-E / Parti
• Diffusion Models: DALL-E 2 / Imagen / Stable Diffusion
• These pretrained models may be used as priors for a variety of vision tasks, and we are particularly interested in 3D reconstruction.
• At a high level, you can think of these models as a tool for optimizing the realism of an input image.
• In this way, they enable an elegant approach to 3D generation and reconstruction: using these large-scale pretrained models to enforce that a differentiable scene looks realistic from random views.

Proposal

1. We propose RealFusion, a method that can extract from a single image of an object a 360◦ photographic 3D reconstruction without assumptions on the type of object imaged or 3D supervision of any kind;

2. We do so by leveraging an existing 2D diffusion image generator via a new single image variant of textual inversion;

3. We also introduce new regularizers and provide an efficient implementation using InstantNGP;

4. We demonstrate state-of-the-art reconstruction results on a number of in-the-wild images and images from existing datasets when compared to alternative approaches.

• Image-based reconstruction of appearnce and geometry
• Few-view reconstruction
• Single-view reconstruction
• Extracting 3D models from 2D generators
• Diffusion Models

Method

• This approach forms the backbone of our method, RealFusion.
1. [Init] We are given a single image and a function $\boldsymbol{p}_{\text {prior }}(\cdot)$ which computes the likelihood of an input image $\boldsymbol{I}$. We choose a camera view and represent our scene with a differentiably-renderable representation $\boldsymbol{x}$, for example a NeRF.
2. [Reconstruction] We render $\boldsymbol{x}$ from our given view and minimize the loss with respect to the real input image $\mathbf{I}$.
3. [Prior] We render images $\boldsymbol{I}_{\text {prior }}$ of $\boldsymbol{x}$ from randomly-chosen views on a hemisphere surrounding the origin, and we optimize $\boldsymbol{p}_{\text {prior }}\left(\boldsymbol{I}_{\text {priol }}\right)$ to enforce that $\boldsymbol{x}$ looks realistic from all directions.
• Prior work has explored this question in the domain of 3D generation
• Dreamfields: CLIP prior
• DreamFusion: Diffusion model prior
• In our work, we adopt a diffusion model prior using Stable Diffusion, a text-conditional latent diffusion model.
• As currently stated, our set up combines a reconstruction objective with a latent diffusion-based prior objective, which is conditioned on a manual text prompt (e.g. "An image of a fish.")
• However, we found that these results were lacking.
• In particular, the 3D shapes that are generated look like the input object from the input view, but do not look like the input object from other views.
• To fix this, we need to modify the prior to place a high likelihood on our input object, rather than a generic object with the same description.
• We do so by performing textual inversion.
• We optimize a text embedding $\mathbf{e}$ in the text encoder of the diffusion model to match our input image.
• Usually textual inversion is performed with multiple views of an object, but we substitute these views with heavy image augmentations.
• We also add other pieces of regularization:
1. A regularization on rendered normals
2. A coarse-to-fine training setup
• However, the key piece of the puzzle is the textual inversion.

Limitations

• Requires per-image optimization
• Both the textual inversion and the 3D optimization procedure must be performed separately for each input image.
• As a result, the process is relatively slow and difficult to apply to large datasets
• In some cases, reconstruction fails to produce a solid shape
• Perhaps this could be alleviated with better inductive biases or regularization terms
• In some cases, reconstruction produces two-headed objects
• This is known as the Janus Problem

Realfusion
http://enderfga.cn/2023/03/01/Realfusion/

Enderfga

2023年3月1日