GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

1Max Planck Institute for Intelligent Systems - Tübingen, 2ETH Zürich, 3University of Tübingen, 4Tübingen AI Center, 5University of Cambridge

CVPR 2024

teaser image.

GraphDreamer takes scene graphs as input and generates object compositional 3D scenes.

Abstract

We present GraphDreamer, the first framework capable of generating compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges.

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts.

Our method makes better use of the pretrained text-to-image diffusion model by exploiting node and edge information in scene graphs, and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

Method

paper pipeline.

GraphDreamer decomposes the scene graph into global, node-wise and edge-wise text description, and represents and optimizes the SDF-based objects individually with identity-aware object fields.

Leveraging Scene Graphs For Text Grounding

To save the effort of building a scene graph from scratch, the scene graph can be generated by a language model (e.g., ChatGPT) from a user text input.

Inverse Semantics: A New Paradigm

We also present a new paradigm of semantic reconstruction with GPT4V-guided GraphDreamer.


Extract scene graph from images

Using GraphDreamer, we can inverse the semantics in a given image into a 3D scene, by extracting a scene graph directly from an input image with GPT4V, restricting the nodes present to the most salient ones.

Interpolate inverse example image 1.

Identify object centers for initialization

To inverse more complex semantics with more salient nodes in an image, we can ask GPT to provide with center coordinates for each object and initialize the SDF-based objects as spheres centered at these coordinates.

Interpolate inverse example image 2.

Acknowledgment

The authors extend their thanks to Zehao Yu and Stefano Esposito for their invaluable feedback on the initial draft. Our thanks also go to Yao Feng, Zhen Liu, Zeju Qiu, Yandong Wen, and Yuliang Xiu for their proofreading of the final draft and for their insightful suggestions which enhanced the quality of this paper. Additionally, we appreciate the assistance of those who participated in our user study. Weiyang Liu and Bernhard Schölkopf was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP XX, project number: 276693517. Andreas Geiger and Anpei Chen were supported by the ERC Starting Grant LEGO-3D (850533) and the DFG EXC number 2064/1 - project number 390727645.

BibTeX

@Inproceedings{gao2024graphdreamer,
  author    = {Gao, Gege and Liu, Weiyang and Chen, Anpei and Geiger, Andreas and Schölkopf, Bernhard},
  title     = {GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024},
}