MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

1School of Electronic and Computer Engineering, Peking University, Shenzhen, China
2Peng Cheng Laboratory, Shenzhen, China
3Tsinghua University, Beijing, China
*Equal contribution
Corresponding author

Abstract

Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation (SAD) to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention (DLFA), which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation.

Key Characteristics

Description of the image

Motivation

CogVideoX
Semantic Confusion
Misaligned spatial relationship
Missing Entity
MagicComp

A sculpture displayed behind a candle

A cat sitting on the right of a fireplace

A lion sitting behind a chicken

Method

Description of the image

Qualitative Comparison

Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Man in a black suit driving a red sports car
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A book on the left of a bird
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A ballon drifts right to left above a statue in a city square
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A kid and a penguin watch a movie in the cinema
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Bear journalist interviews a celebrity
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Six children talking and three crystal balls on a table

More Qualitative Comparison

Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Big hearts and small stars floating upwards
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Oblong canoe gliding past a circular buoy
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Star-shaped cookie resting on a round coaster
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
Green tractor plowing near a white farmhouse
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A girl walking on the right of an elephant
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A photographer setting up a tripod on the left of a tree
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A sheep walks on the grass as a hot air ballon floats overhead
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A sheep walks on the grass as a hot air ballon floats overhead
Open-Sora-Plan
Videotertis
Vico
CogvideoX-2B
MagicComp
A boat sails to the right on the ocean

Ablation Experiments

w/o SAD
w/ SAD
A gray cat and a brown dog playing together in a sunny backyard
w/ 3D Region Attention
w/ DLFA
A sculpture displayed behind a candle

Complex Scene

Videotertis
Vico
CogvideoX-2B
MagicComp
A skeleton pirate steering a ghost ship through a dark sea, with a ghost ship through glowing lanterns lighting the way. A treasure map flutters in the wind.

Adapt MagicComp on VideoCrafter2

VideoCrafter2
MagicComp

A sculpture displayed behind a candle

A cat sitting on the right of a fireplace

A lion sitting behind a chicken

BibTeX

@misc{zhang2025magiccomptrainingfreedualphaserefinement,
        title={MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation}, 
        author={Hongyu Zhang and Yufan Deng and Shenghai Yuan and Peng Jin and Zesen Cheng and Yian Zhao and Chang Liu and Jie Chen},
        year={2025},
        eprint={2503.14428},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.14428}, 
  }