SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Mia Kan1, Yilin Liu1, 2, Niloy Mitra1,3
1University College London 2Autodesk Research 3Adobe Research


Videos on the page may require some time to load.



Motivational examples



Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naïve approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.



SAGE: Structure-Aware Generative vidEo transitions. Given two diverse clips (top), Clip A and Clip B, prior approaches (bottom) (e.g., crossfade, DiffMorpher, Generative Inbetweening) often suffer from ghosting, structural collapse, or flicker. We introduce SAGE that extracts structural lines and motion cues, propagates them via B-spline trajectories to produce structural guidance. The guidance is then used to condition a pretrained diffusion model to synthesize temporally smooth with motion-coherent transitions (middle) in a zero-shot setting.


Results

Scroll left/right

  • Horse-dog

  • Helicopter-boat

  • Cab-train

  • Handkerchief-cruise

  • Surfer-speedboat

  • Castle-palace

  • Candy-cloud

  • Fish-giraffe

  • Skatepark-biker

Comparisons

Hover/leave mouse on video to play/pause videos of entire row.

Input – Clip A

Input – Clip B

ours FILM ['22] TVG ['24] DiffMorpher ['24] VACE ['25] GI ['25]
Cab-train
Helicopter-boat
Fish-giraffe
Skatepark-biker
Candy-cloud
Handkerchief-cruise

Input – Clip A

Input – Clip B

ours FILM ['22] TVG ['24] DiffMorpher ['24] VACE ['25] GI ['25]

SAGE (ours) vs. SEINE

We selected two challenging examples from the SEINE project page where the start and end frames differ significantly. To create comparable inputs, we generated short video clips around these frames using an image-to-video model. We then applied SAGE to these clip pairs. The results below highlight that SAGE produces smoother, structurally consistent transitions, while SEINE often struggles with motion coherence and structural stability in such diverse settings.



Line interpolation: applying motion to detected structures with optical flow

Hover/leave mouse to play/pause each input video.

Input – Clip A

Flow

Interpolation

Flow

Input – Clip B

Cab-train
Helicopter-boat
Surfer-speedboat
Horse-dog


Ablation study

Note that when the motions in the two source clips are already well aligned, simple linear blending of matched structures performs comparably to our B-spline approach. However, for more diverse clip pairs—where motion trajectories differ in direction, scale, or continuity—B-spline interpolation produces smoother global paths and more natural motion inbetweening, avoiding the abrupt shifts and trajectory crossings that arise with linear blending.



* Note: In the top row, the detected lines are shown only during the transition frames.

  • w/o structural guidance

    w/o layer-aware matching

    w/o motion-aware trajectories

    ours

  • w/o structural guidance

    w/o layer-aware matching

    w/o motion-aware trajectories

    ours

  • w/o structural guidance

    w/o layer-aware matching

    w/o motion-aware trajectories

    ours

  • w/o structural guidance

    w/o layer-aware matching

    w/o motion-aware trajectories

    ours