ATATA: One Algorithm to Align Them All

Boyi Pang^1,*
Savva Ignatyev^2,*
Vladimir Ippolitov^2,*
Ramil Khafizov²
Yurii Melnik²
Oleg Voynov²
Maksim Nakhodnov^3,4
Aibek Alanov^3,5
Xiaopeng Fan^1,6,7,†
Peter Wonka^8,†
Evgeny Burnaev²

¹Harbin Institute of Technology
²Applied AI Institute
³FusionBrain Lab
⁴MSU
⁵HSE University
⁶The Peng Cheng Laboratory
⁷HIT Suzhou Research Institute
⁸KAUST

^*Boyi, Savva and Vladimir contributed equally
^†corresponding authors

Abstract

We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

Running dog ↔ Running tiger

Formula 1 car taking a tight corner on a wet track ↔ Kayak taking a tight corner on a mountain river

Wide valley confrontation featuring a medieval knight ↔ Wide valley confrontation featuring a hairy Neanderthal

Ancient amphitheatre ↔ Modern football stadium

Space marine ↔ WW2 soldier

Duel of two knights ↔ Duel of two samurais

Atakebun wooden ship ↔ Modern yacht

Telescope ↔ Cannon

Gothic cathedral ↔ Hindu temple

Our algorithm enables structurally aligned outputs across different modalities (such as image, video, 3D) when integrated into generative Flow Matching pipelines (FLUX.1, Wan2.1, Trellis). Our training-free method modifies only the denoising stage of pre-trained models and yields generations that smoothly transition from one to another.

Comparison: Video

We evaluate our method using a set of prompt pairs. These pairs describe dynamic or static scenes that share structural similarity while still differing in content, e.g., "medieval market - cyberpunk market". We use these pairs to generate aligned videos with our method and with competing approaches.

Visual comparison is shown below. Each example shows the results produced by our method on the left and by one of the competing methods on the right. For each method, we show the generated pair of videos corresponding to the pair of prompts shown below. Our method generates structurally and geometrically aligned videos, whereas competing approaches struggle to preserve this structural consistency, resulting in greater pose differences between objects.

Ours

MatchDiffusion

Ancient Roman market ↔ cyberpunk market

Ours

VACE

cooking woman ↔ cooking man

Ours

MatchDiffusion

man climbing on a cliff ↔ man climbing on a building

Ours

LucyEdit

walking dinosaur ↔ walking bird

Ours

MatchDiffusion

running tiger ↔ running dog

Comparison: Images

In this section we demonstrate superiority of our method in achieving structural alignment in the image generation task. Similarly to the video generation task, we use a set of prompt pairs (such as "ant - crab") to assess the ability of our method to generate aligned images.

Below we show visual comparison of our method against competing methods. The first column shows the results produced by our method, the second column shows the results produced by Qwen-Image-Edit, and the third column shows the results produced by RF-Inversion. For each method, we show the generated pair of images corresponding to the pair of prompts shown below. Our method generates pairs of images with significantly better structural alignment.

Ours

Qwen-Image-Edit

RF-Inversion

Ant ↔ Crab

Horse ↔ Horse skeleton

Gopher ↔ Kangaroo

Blending: Videos

We demonstrate the structural similarity of pairs of videos generated using our method by showing a transition from one video to another. To obtain this transition, we merge two generated latents into one using their linear combination with blending coefficient that increases from 0 to 1 as the video progresses from the first frame to the last one. As a result, the objects in the videos smoothly transform one into another, revealing their structural similarity.

Flying eagle ↔ Dragonfly

Neanderthal ↔ King

Human boxer and human karatist fighting ↔ Robot boxer and human karatist fighting

Running dog animal ↔ Running dog robot

Seven zebras running in a circle ↔ Seven tigers running in a circle

Man climbing a cliff ↔ Man climbing a building

Blending: Images

We also demonstrate the structural similarity of pairs of images generated using our method by showing a blend between two images. To obtain it, we use alpha-blending coefficient depending on the column index. This coefficient increases from 0 to 1 from left to right.

Smooth transitions appear not only in object-centric generations such as "dog-robot" or "spider-octopus", but also in scene-centric generations where the entire scene is changed from one prompt to another (e.g., "ancient amphitheatre - modern stadium").

Running dog animal ↔ Running dog robot

Ancient amphitheatre ↔ Modern stadium

Ant ↔ Crab

Bird ↔ Dinosaur

Spider ↔ Octopus

Steampunk ↔ Cyperpunk

Horse ↔ Horse skeleton

Snake ↔ Rope

Guinea pig ↔ Pig

More image results

Here we show more pairs of images generated using our method combined with FLUX.1 image generation pipeline.

A zebra is drinking ↔ A tiger is drinking

Comb jelly undulating in bioluminescent threads ↔ larval octopus undulating in bioluminescent threads, pigeons startled

Concert stage with a pianist and a cellist performing in counterpoint ↔ Concert stage with a harpist and a flutist performing in counterpoint

Farm lane with two tractors rolling past hay bales ↔ Farm lane with two horses pulling wagons past hay bales

Limestone cave chamber ↔ Glacial ice cave chamber

Open steppe with a sweeping herd of antelope crossing a river braid ↔ Open steppe with a sweeping herd of wild horses crossing a river braid

Rainy gutter stream with paper boats drifting past curb leaves ↔ Rainy gutter stream with wooden toy boats drifting past curb leaves

Art tabletop with ink plumes blooming in a paper marbling bath ↔ Art tabletop with paint plumes blooming in a paper marbling bath

Tundra thermals with a snowy owl hovering above frost grass ↔ Tundra thermals with a frost wyvern hovering above frost grass

More 3D generation results

Here we show more pairs of 3D objects generated using our method combined with Trellis 3D generation pipeline.

Spider with eight articulated legs, rounded segmented body, detailed joints, poised stance ↔ Octopus with rounded central body, eight flexible tentacles, smooth mantle, dynamic spread pose

Jellyfish with dome-shaped bell, trailing tentacles, smooth translucent surface, gentle flowing pose ↔ Parachute with dome-shaped canopy, cords hanging down, smooth fabric surface, inflated structure

Snail with coiled spiral shell, soft body extended, small tentacles, glossy shell surface ↔ Hermit crab carrying coiled shell, segmented legs and claws extended, textured exoskeleton

Brass-and-wood telescope on simple tripod, polished tube, clean joints, subtle metallic reflections, smooth wooden struts ↔ Dark metal cannon on wooden carriage, smooth barrel, simple wheels, subtle surface wear, matching clean metallic reflections

Triceratops with large frilled head, three facial horns, bulky muscular body, four sturdy legs, textured skin ↔ Rhino with thick muscular body, prominent horn, sturdy legs, textured skin, grounded stance

Helicopter with central fuselage, elongated body, main rotor on top, tail rotor, landing skids ↔ Dragonfly with slender elongated body, four translucent wings spread, segmented tail, large compound eyes

ATATA: One Algorithm to Align Them All

Abstract

Comparison: Video

Comparison: Images

Blending: Videos

Blending: Images

More video results

More image results

More 3D generation results