A3D: Does Diffusion Dream about 3D Alignment?

ICLR 2025

¹Skoltech
²AIRI
³Medida AI
⁴AI Foundation and Algorithm Lab
⁵KAUST

^*Savva and Nina contributed equally

Abstract

We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods handle multiple text queries separately, and therefore the resulting objects have a high variability in object pose and structure. However, in some applications, such as 3D asset design, it may be desirable to obtain a set of objects aligned with each other. In order to achieve the alignment of the corresponding parts of the generated objects, we propose to embed these objects into a common latent space and optimize the continuous transitions between these objects. We enforce two kinds of properties of these transitions: smoothness of the transition and plausibility of the intermediate objects along the transition. We demonstrate that both of these properties are essential for good alignment. We provide several practical scenarios that benefit from alignment between the objects, including 3D editing and object hybridization, and experimentally demonstrate the effectiveness of our method.

Our method A3D enables conditioning text-to-3D generation process on a set of text prompts to jointly generate a set of 3D objects with a shared structure (top). This enables a user to make "hybrids" combined of different parts from multiple aligned objects (middle), or perform text-driven structure-preserving transformation of an input 3D model (bottom).

Motivation

Collections of objects generated with existing text-to-3D methods lack structural consistency (top). Shapes obtained with existing text-driven 3D editing methods lack text-to-asset alignment and visual quality (middle). In contrast, our method enables the generation of structurally coherent, text-aligned assets with high visual quality (bottom).

Generation of multiple aligned 3D objects

We evaluate our method in the generation of sets of aligned objects on 15 pairs of prompts describing pairs of objects with similar morphology but different geometry and appearance, such as a car and a carriage. We include various categories of objects, namely different kinds of animals, humanoids, plants, vehicles, furniture, and buildings.

Below, we show pairs of objects generated with existing methods and our method. Each pair of rows shows the results for one pair of prompts written below. For each method, we show one of the generated objects in the pair and the rendering of its geometry. Use the slider below to switch between the objects in the pair. Use the controls further down to switch between different examples.

Our method generates both objects in the pair simultaneously, while the other methods initially generate one of the objects and then generate the other one from the first. We show two sets of results: with the object 1 generated first, denoted with p₁→p₂, and with the object 2 generated first, denoted with p₁←p₂.

Object 1 Object 2

Below, we show more examples of sets of aligned objects generated with our method.

Hybridization: combining the aligned 3D objects

We show examples of the hybrid objects combining parts of aligned objects produced by our method, and illustrate the process of getting these hybrids below. For some examples, we intentionally generate the objects using different hyperparameters than for the generation of pairs above, to increase the visual difference between the generated objects for better visibility of the hybridization.

We show pairs or triplets of the generated objects in the first two or three columns, and different variants of the hybrid models in the next column. To choose which part of each object we want to use, we assign several anchor points to each object and manually place these points in the common 3D space of the objects. We show these points in the last column, with different colors corresponding to different objects. We define the spatial distribution of the latent code (shown in the second to last column) via linear interpolation between the latent codes corresponding to the objects associated with the two closest anchors. Use the controls below to switch between different examples.

Structure-preserving transformation of 3D models

We evaluate the capability of our method to transform an initial 3D model while preserving its structure on 26 text prompts. For each prompt we find a coarse initial model with the desired structure on the web, or use the SMPL parametric human body model in a desired pose.

Below, we show the results obtained with existing methods and our method. Each pair of rows shows the results for the text prompt written below. For each method, we show the transformed object and the rendering of its geometry. Use the slider below to switch between the transformed and the initial 3D model. Use the controls further down to switch between different examples. LucidDreamer diverged for the examples on slides 7-13.

Initial model Transformed model

Ablation

We compare our method with two branches of baselines for generating pairs of objects. We refer to the baselines in the first branch as (A), (B), (C), and to the baselines in the second branch as (E), (F), while (D) is our complete method. See the description of the baselines in the paper.

Below, we show the generated pairs of objects. Each pair of rows shows the results for one pair of prompts written below. For each method, we show one of the generated objects in the pair, use the slider below to switch to the other object. Additionally, we show the silhouettes of the objects, demonstrating the alignment of their structural parts. Use the controls further down to switch between different examples.

Object 1 Object 2

Varying degree of alignment

We can control the degree of alignment between the objects generated with our method and choose between more strict or more loose alignment by varying the strength of the regularizations that we use. We demonstrate the effects of progressively decreasing the strength of plausibility regularization below.

We compare the results of our full method (D) with the results obtained with a decreased strength of the regularization (G), and without the regularization (B). Each three columns show the results for one pair of prompts written below. We show one of the generated objects in the pair, use the slider below to switch to the other object. Additionally, we show the silhouettes of the objects, demonstrating the alignment of their structural parts.

Object 1 Object 2

When we decrease the plausibility of transitions, the objects become less strictly aligned with each other and obtain more characteristic properties corresponding to the text prompts. For example, the ant and crab get generated with different numbers of legs, and the crab obtains a pair of claws, while the proportions of the gopher and kangaroo become more naturalistic.

Below, we show an extreme example of a loose alignment between naturally very different objects: a chair and a human. Our method aligns the overall pose of the human with the structure of the chair.

Object 1 Object 2

Transitions between the aligned 3D objects

Below, we show examples of transitions between the generated objects. Each row shows the transition between the objects generated for the pair of prompts written below. We show the transition of the color in two projections on the left and the transition of the geometry on the right.

The transitions are generally smooth, gradually transforming one object into another. The plausibility of the intermediate results is higher for pairs of objects with a higher degree of alignment.

We additionally show examples of transitions for pairs generated with the version of our method based on RichDreamer. This version produces smoother transitions, which we relate to the implicit surface prior that encourages the network to learn smooth geometry.

BibTeX

@inproceedings{ignatyev2025a3d,
    title      = {{A3D}: Does Diffusion Dream about 3D Alignment?},
    shorttitle = {{A3D}},
    author     = {Ignatyev, Savva and Konovalova, Nina and Selikhanovych, Daniil and Voynov, Oleg and Patakin, Nikolay and Olkov, Ilya and Senushkin, Dmitry and Artemov, Alexey and Konushin, Anton and Filippov, Alexander and Wonka, Peter and Burnaev, Evgeny},
    year       = {2025},
    booktitle  = {The Thirteenth International Conference on Learning Representations},
    url        = {https://openreview.net/forum?id=QQCIfkhGIq},
    langid     = {english}
}