A3D: Does Diffusion Dream about 3D Alignment?

Abstract

We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods handle multiple text queries separately, and therefore the resulting objects have a high variability in object pose and structure. However, in some applications, such as 3D asset design, it may be desirable to obtain a set of objects aligned with each other. In order to achieve the alignment of the corresponding parts of the generated objects, we propose to embed these objects into a common latent space and optimize the continuous transitions between these objects. We enforce two kinds of properties of these transitions: smoothness of the transition and plausibility of the intermediate objects along the transition. We demonstrate that both of these properties are essential for good alignment. We provide several practical scenarios that benefit from alignment between the objects, including 3D editing and object hybridization, and experimentally demonstrate the effectiveness of our method.

Our method A3D enables conditioning text-to-3D generation process on a set of text prompts to jointly generate a set of 3D objects with a shared structure (top). This enables a user to make "hybrids" combined of different parts from multiple aligned objects (middle), or perform text-driven structure-preserving transformation of an input 3D model (bottom).

Motivation

Collections of objects generated with existing text-to-3D methods lack structural consistency (top). Shapes obtained with existing text-driven 3D editing methods lack text-to-asset alignment and visual quality (middle). In contrast, our method enables the generation of structurally coherent, text-aligned assets with high visual quality (bottom).

Generation of multiple aligned 3D objects

We evaluate our method in the generation of sets of aligned objects on 15 pairs of prompts describing pairs of objects with similar morphology but different geometry and appearance, such as a car and a carriage. We include various categories of objects, namely different kinds of animals, humanoids, plants, vehicles, furniture, and buildings.

Below, we show pairs of objects generated with existing methods and our method. Each pair of rows shows the results for one pair of prompts written below. For each method, we show one of the generated objects in the pair and the rendering of its geometry. Use the slider below to switch between the objects in the pair. Use the controls further down to switch between different examples.

Our method generates both objects in the pair simultaneously, while the other methods initially generate one of the objects and then generate the other one from the first. We show two sets of results: with the object 1 generated first, denoted with p1→p2, and with the object 2 generated first, denoted with p1←p2.

Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2

Below, we show more examples of sets of aligned objects generated with our method.

Hybridization: combining the aligned 3D objects

We show examples of the hybrid objects combining parts of aligned objects produced by our method, and illustrate the process of getting these hybrids below. For some examples, we intentionally generate the objects using different hyperparameters than for the generation of pairs above, to increase the visual difference between the generated objects for better visibility of the hybridization.

We show pairs or triplets of the generated objects in the first two or three columns, and different variants of the hybrid models in the next column. To choose which part of each object we want to use, we assign several anchor points to each object and manually place these points in the common 3D space of the objects. We show these points in the last column, with different colors corresponding to different objects. We define the spatial distribution of the latent code (shown in the second to last column) via linear interpolation between the latent codes corresponding to the objects associated with the two closest anchors. Use the controls below to switch between different examples.

Structure-preserving transformation of 3D models

We evaluate the capability of our method to transform an initial 3D model while preserving its structure on 26 text prompts. For each prompt we find a coarse initial model with the desired structure on the web, or use the SMPL parametric human body model in a desired pose.

Below, we show the results obtained with existing methods and our method. Each pair of rows shows the results for the text prompt written below. For each method, we show the transformed object and the rendering of its geometry. Use the slider below to switch between the transformed and the initial 3D model. Use the controls further down to switch between different examples. LucidDreamer diverged for the examples on slides 7-13.

Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model
Initial model Transformed model

Ablation

We compare our method with two branches of baselines for generating pairs of objects. We refer to the baselines in the first branch as (A), (B), (C), and to the baselines in the second branch as (E), (F), while (D) is our complete method. See the description of the baselines in the paper.

Below, we show the generated pairs of objects. Each pair of rows shows the results for one pair of prompts written below. For each method, we show one of the generated objects in the pair, use the slider below to switch to the other object. Additionally, we show the silhouettes of the objects, demonstrating the alignment of their structural parts. Use the controls further down to switch between different examples.

Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2
Object 1 Object 2

BibTeX

@misc{ignatyev2024a3d,
    title         = {{A3D}: Does Diffusion Dream about 3D Alignment?},
    author        = {Savva Ignatyev and Nina Konovalova and Daniil Selikhanovych and Oleg Voynov and Nikolay Patakin and Ilya Olkov and Dmitry Senushkin and Alexey Artemov and Anton Konushin and Alexander Filippov and Peter Wonka and Evgeny Burnaev},
    year          = {2024},
    eprint        = {2406.15020},
    archivePrefix = {arXiv}
}