MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Hongyu Zhang^1*, Yufan Deng^1*, Shenghai Yuan¹, Peng Jin¹, Zesen Cheng¹, Yian Zhao¹, Chang Liu³, Jie Chen^{1,2 †}

¹School of Electronic and Computer Engineering, Peking University, Shenzhen, China
²Peng Cheng Laboratory, Shenzhen, China
³Tsinghua University, Beijing, China
^*Equal contribution
^†Corresponding author

Paper Code

Abstract

Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation (SAD) to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention (DLFA), which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation.

Key Characteristics

Training-free: Without any parameter updates or gradient optimization.
Model Agnostic: Seamlessly adapt to DiT-based and UNet-based framework.
Fantastic Performance: Obtain 15.2% performance gain with CogVideoX on T2V-CompBench.
Minimal Additional Inference Cost: Only 16% additional inference time.

Motivation

The primary challenge in compositional T2V (e.g., semantic confusion, missing subjects) lies in addressing two interrelated objectives:
(1) Preventing semantic ambiguity and leakage among multiple subjects.
(2) Ensuring precise attribute–location binding for each subject.
Achieve both objectives is challenging, especially when the model is frozen. A natural approach, therefore, is to decompose these objectives into distinct phases during sampling, enabling progressive refinement at each stage.

CogVideoX

Semantic Confusion

Misaligned spatial relationship

Missing Entity

MagicComp

A sculpture displayed behind a candle

A cat sitting on the right of a fireplace

A lion sitting behind a chicken

Method

MagicComp resolves inter-subject semantic disambiguation and achieves spatio-temporal binding between subjects and their textual prompts through sequential refinement.
During conditioning: We introduce Semantic Anchor Disambiguation (SAD) to reinforces subject-specific semantics and resolve inter-subject ambiguity by injecting the directional vectors of semantic anchors into original text embedding
During denoising: We propose Dynamic Layout Fusion Attention (DLFA), which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation.

Qualitative Comparison

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Man in a black suit driving a red sports car

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A book on the left of a bird

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A ballon drifts right to left above a statue in a city square

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A kid and a penguin watch a movie in the cinema

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Bear journalist interviews a celebrity

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Six children talking and three crystal balls on a table

More Qualitative Comparison

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Big hearts and small stars floating upwards

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Oblong canoe gliding past a circular buoy

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Star-shaped cookie resting on a round coaster

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

Green tractor plowing near a white farmhouse

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A girl walking on the right of an elephant

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A photographer setting up a tripod on the left of a tree

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A sheep walks on the grass as a hot air ballon floats overhead

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A sheep walks on the grass as a hot air ballon floats overhead

Open-Sora-Plan

Videotertis

Vico

CogvideoX-2B

MagicComp

A boat sails to the right on the ocean

Ablation Experiments

w/o SAD

w/ SAD

A gray cat and a brown dog playing together in a sunny backyard

w/ 3D Region Attention

w/ DLFA

A sculpture displayed behind a candle

Complex Scene

Videotertis

Vico

CogvideoX-2B

MagicComp

A skeleton pirate steering a ghost ship through a dark sea, with a ghost ship through glowing lanterns lighting the way. A treasure map flutters in the wind.

Adapt MagicComp on VideoCrafter2

VideoCrafter2

MagicComp

A sculpture displayed behind a candle

A cat sitting on the right of a fireplace

A lion sitting behind a chicken

BibTeX

@misc{zhang2025magiccomptrainingfreedualphaserefinement,
        title={MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation}, 
        author={Hongyu Zhang and Yufan Deng and Shenghai Yuan and Peng Jin and Zesen Cheng and Yian Zhao and Chang Liu and Jie Chen},
        year={2025},
        eprint={2503.14428},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.14428}, 
  }