ATATA: One Algorithm to Align Them All

Abstract

We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

Running dog Running tiger
Formula 1 car taking a tight corner on a wet track Kayak taking a tight corner on a mountain river
Wide valley confrontation featuring a medieval knight Wide valley confrontation featuring a hairy Neanderthal
Ancient amphitheatre Modern football stadium
Space marine WW2 soldier
Duel of two knights Duel of two samurais
Atakebun wooden ship Modern yacht
Telescope Cannon
Gothic cathedral Hindu temple

Our algorithm enables structurally aligned outputs across different modalities (such as image, video, 3D) when integrated into generative Flow Matching pipelines (FLUX.1, Wan2.1, Trellis). Our training-free method modifies only the denoising stage of pre-trained models and yields generations that smoothly transition from one to another.

Comparison: Video

We evaluate our method using a set of prompt pairs. These pairs describe dynamic or static scenes that share structural similarity while still differing in content, e.g., "medieval market - cyberpunk market". We use these pairs to generate aligned videos with our method and with competing approaches.

Visual comparison is shown below. Each example shows the results produced by our method on the left and by one of the competing methods on the right. For each method, we show the generated pair of videos corresponding to the pair of prompts shown below. Our method generates structurally and geometrically aligned videos, whereas competing approaches struggle to preserve this structural consistency, resulting in greater pose differences between objects.

Ours
MatchDiffusion
Ancient Roman market cyberpunk market
Ours
VACE
cooking woman cooking man
Ours
MatchDiffusion
man climbing on a cliff man climbing on a building
Ours
LucyEdit
walking dinosaur walking bird
Ours
MatchDiffusion
running tiger running dog

Comparison: Images

In this section we demonstrate superiority of our method in achieving structural alignment in the image generation task. Similarly to the video generation task, we use a set of prompt pairs (such as "ant - crab") to assess the ability of our method to generate aligned images.

Below we show visual comparison of our method against competing methods. The first column shows the results produced by our method, the second column shows the results produced by Qwen-Image-Edit, and the third column shows the results produced by RF-Inversion. For each method, we show the generated pair of images corresponding to the pair of prompts shown below. Our method generates pairs of images with significantly better structural alignment.

Ours
Qwen-Image-Edit
RF-Inversion
Ant Crab
Horse Horse skeleton
Gopher Kangaroo

Blending: Videos

We demonstrate the structural similarity of pairs of videos generated using our method by showing a transition from one video to another. To obtain this transition, we merge two generated latents into one using their linear combination with blending coefficient that increases from 0 to 1 as the video progresses from the first frame to the last one. As a result, the objects in the videos smoothly transform one into another, revealing their structural similarity.

Flying eagle Dragonfly
Neanderthal King
Human boxer and human karatist fighting Robot boxer and human karatist fighting
Running dog animal Running dog robot
Seven zebras running in a circle Seven tigers running in a circle
Man climbing a cliff Man climbing a building

Blending: Images

We also demonstrate the structural similarity of pairs of images generated using our method by showing a blend between two images. To obtain it, we use alpha-blending coefficient depending on the column index. This coefficient increases from 0 to 1 from left to right.

Smooth transitions appear not only in object-centric generations such as "dog-robot" or "spider-octopus", but also in scene-centric generations where the entire scene is changed from one prompt to another (e.g., "ancient amphitheatre - modern stadium").

Running dog animal Running dog robot
Ancient amphitheatre Modern stadium
Ant Crab
Bird Dinosaur
Spider Octopus
Steampunk Cyperpunk
Horse Horse skeleton
Snake Rope
Guinea pig Pig

More video results

Here we show more pairs of videos generated using our method combined with Wan 2.1 video generation pipeline, which illustrate generalization of our method to different prompts.

High-speed macro capture of a hummingbird High-speed macro capture of a dragonfly
Savanna river ford with a herd of wildebeest Arctic river ford with a herd of caribou
Leeward desert dune with sand avalanche Leeward alpine ridge with snow slough cascade
Forest trail sprint with a red fox dashing Forest trail sprint with a hare dashing
Wide desert ridge with sandboarders carving Wide alpine ridge with skiers carving smoothly
Countryside causeway with a train slicing Countryside causeway with a train slicing
Cliff road with a vintage car drifting Cliff road with a rally bike sliding
Snowfield with two sled dogs pulling a load Snowfield with two reindeer pulling a light sled
City marathon avenue with thousands of runners City marathon avenue with thousands of cyclists
Harbor festival with a sky full of kites Harbor festival with a sky full of prayer flags

More image results

Here we show more pairs of images generated using our method combined with FLUX.1 image generation pipeline.

A zebra is drinking A tiger is drinking
Comb jelly undulating in bioluminescent threads larval octopus undulating in bioluminescent threads, pigeons startled
Concert stage with a pianist and a cellist performing in counterpoint Concert stage with a harpist and a flutist performing in counterpoint
Farm lane with two tractors rolling past hay bales Farm lane with two horses pulling wagons past hay bales
Limestone cave chamber Glacial ice cave chamber
Open steppe with a sweeping herd of antelope crossing a river braid Open steppe with a sweeping herd of wild horses crossing a river braid
Rainy gutter stream with paper boats drifting past curb leaves Rainy gutter stream with wooden toy boats drifting past curb leaves
Art tabletop with ink plumes blooming in a paper marbling bath Art tabletop with paint plumes blooming in a paper marbling bath
Tundra thermals with a snowy owl hovering above frost grass Tundra thermals with a frost wyvern hovering above frost grass

More 3D generation results

Here we show more pairs of 3D objects generated using our method combined with Trellis 3D generation pipeline.

Spider with eight articulated legs, rounded segmented body, detailed joints, poised stance Octopus with rounded central body, eight flexible tentacles, smooth mantle, dynamic spread pose
Jellyfish with dome-shaped bell, trailing tentacles, smooth translucent surface, gentle flowing pose Parachute with dome-shaped canopy, cords hanging down, smooth fabric surface, inflated structure
Snail with coiled spiral shell, soft body extended, small tentacles, glossy shell surface Hermit crab carrying coiled shell, segmented legs and claws extended, textured exoskeleton
Brass-and-wood telescope on simple tripod, polished tube, clean joints, subtle metallic reflections, smooth wooden struts Dark metal cannon on wooden carriage, smooth barrel, simple wheels, subtle surface wear, matching clean metallic reflections
Triceratops with large frilled head, three facial horns, bulky muscular body, four sturdy legs, textured skin Rhino with thick muscular body, prominent horn, sturdy legs, textured skin, grounded stance
Helicopter with central fuselage, elongated body, main rotor on top, tail rotor, landing skids Dragonfly with slender elongated body, four translucent wings spread, segmented tail, large compound eyes