PoCo: Multi-Reference Multi-Shot Video Generation

🎯 Abstract

✨ Do you want to generate a short film with multi-reference and multi-shot control? ✨

Recent proprietary models such as Sora‑2 demonstrate promising progress in generating multi‑shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model’s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token‑level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi‑reference and multi‑shot video generation model capable of accurately controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross‑shot consistency and reference fidelity compared with various baselines.

Multi-Reference Multi-Shot Video Generation Demos

💡Motivation

Why does reference confusion still occur even when the identities are visually distinguishable? The core issue is that text prompts are often too coarse to encode fine-grained identity cues under strong semantic overlap, causing attention to retrieve the wrong reference context.

Comparison of different strategies for multi-reference, multi-shot video generation. (A) Independent single-shot reference-to-video generation produces each shot separately, leading to inconsistent backgrounds and appearance details across shots. (B) Joint multi-shot reference-to-video generation improves global coherence, but without explicit side information, the model may associate a shot with the wrong reference, causing identity confusion. (C) Our PoCo with SideInfo-RoPE enables accurate shot-reference association, yielding consistent identity and background across shots. Right: spatially averaged self-attention over concatenated reference and shot tokens. \(\mathrm{AttnScore}(\mathrm{Shot}_i \rightarrow \mathrm{Ref}_j)\) denotes the mean attention from \(\mathrm{Shot}_i\) to \(\mathrm{Ref}_j\).

🧠 SideInfo-RoPE

Method Highlight

From RoPE and 3D-RoPE to SideInfo-RoPE

A cleaner view of the method: the left column shows the original RoPE and 3D-RoPE setup, and the right column highlights the single conceptual change that matters for PoCo, adding the side-information axis \(s\).

Original

RoPE / 3D-RoPE

Preliminaries first, then the extension from 1D RoPE to the 3D spatial-temporal case.

RoPE Preliminary

RoPE injects relative position directly into the attention score via rotation.

\[ q_m^\top R_{\Delta_{m,n}}(\theta) k_n \] \[ R_{\Delta_{m,n}} = \bigoplus_{i=1}^{D/2} R^{(i)}_{\Delta_{m,n}} \] \[ R^{(i)}_{\Delta_{m,n}}(\theta)= \begin{bmatrix} \cos(\omega_i \cdot \Delta_{m,n}) & -\sin(\omega_i \cdot \Delta_{m,n}) \\ \sin(\omega_i \cdot \Delta_{m,n}) & \cos(\omega_i \cdot \Delta_{m,n}) \end{bmatrix} \] \[ \omega_i = \theta^{-2i/D} \]

3D-RoPE

First define 3D relative displacement over temporal and spatial axes.

\[ \Delta^{p}_{m,n} = \left(\Delta^{t}_{m,n}, \Delta^{h}_{m,n}, \Delta^{w}_{m,n}\right) \]

Then apply blockwise rotations in the allocated t, h, and w subspaces.

\[ \begin{aligned} R^{p}_{\Delta_{m,n}} &= \Bigl\{\bigoplus_{i=1}^{D_t/2} R^{(i)}_{\Delta^t_{m,n}}\Bigr\} \\ &\quad \oplus \Bigl\{\bigoplus_{i=1+\frac{D_t}{2}}^{\frac{D_t + D_h}{2}} R^{(i)}_{\Delta^h_{m,n}}\Bigr\} \\ &\quad \oplus \Bigl\{\bigoplus_{i=\frac{D_t + D_h}{2}+1}^{D/2} R^{(i)}_{\Delta^w_{m,n}}\Bigr\} \end{aligned} \]

Only the t, h, and w axes are modeled here, so reference identity is still resolved mainly through semantic similarity.

Ours

SideInfo-RoPE

Sec. 3.2: augment the original 3D-RoPE axes with a side-information coordinate s.

Side Information Distance

Define side-information mismatch between tokens as a binary vector distance.

\[ \Delta^{s}_{m,n} = \left| s(x_m) - s(x_n) \right| \in \{0,1\}^{K} \]

Here s(x) \in {0,1} K is a K-dimensional side-information vector; s i (x)=1 means reference i (for example, @Character1 or @Character2) is active for the current token/shot context.

SideInfo-RoPE Rotation

Extend p=(t,h,w) to p^*=(t,h,w,s) and add the side-information rotation blocks.

\[ \begin{aligned} R^{p^\ast}_{\Delta_{m,n}} &= \Bigl\{\bigoplus_{i=1}^{D_t/2} R^{(i)}_{\Delta^t_{m,n}}\Bigr\} \\ &\quad \oplus \Bigl\{\bigoplus_{i=1}^{D_s/2} \hat{R}^{(i)}_{\Delta^s_{m,n}}\Bigr\} \\ &\quad \oplus \Bigl\{\bigoplus_{i=1+\frac{D_t + D_s}{2}}^{\frac{D_t + D_s + D_h}{2}} R^{(i)}_{\Delta^h_{m,n}}\Bigr\} \\ &\quad \oplus \Bigl\{\bigoplus_{i=\frac{D_t + D_s + D_h}{2}+1}^{D/2} R^{(i)}_{\Delta^w_{m,n}}\Bigr\} \end{aligned} \] \[ \hat{R}^{(i)}_{\Delta^s_{m,n}} = \begin{bmatrix} \cos\!\bigl(\phi_i \cdot \Delta^{s}_{m,n}(i)\bigr) & -\sin\!\bigl(\phi_i \cdot \Delta^{s}_{m,n}(i)\bigr) \\ \sin\!\bigl(\phi_i \cdot \Delta^{s}_{m,n}(i)\bigr) & \cos\!\bigl(\phi_i \cdot \Delta^{s}_{m,n}(i)\bigr) \end{bmatrix} \] \[ \phi_i = \frac{2\pi i - \pi}{K} \]

The visual point is simple: we do not replace RoPE, we extend it with one more axis. Tokens sharing the same side information remain phase-aligned, while mismatched references receive phase offsets that suppress cross-reference confusion.

✨Pipeline

We propose a multi-reference, multi-shot video generation model conditioned on reference images and per-shot captions.
(a) The overall architecture integrates reference images, shot captions, and latent video features through VAE and MultiShot-DiT blocks. Each block contains Hierarchical Cross-Attention (b) and Self-Attention with SideInfo-RoPE (c).
(b) The hierarchical mask allows reference tokens to attend to all captions, while video tokens in each shot attend only to their corresponding text segment.
(c) SideInfo-RoPE assigns reference-specific phase codes in the rotary embedding space, so that temporally aligned shots inherit the corresponding phase patterns. Colored planes denote active rotations, while gray planes denote unrotated ones.

🧩 Hard Cases: Identity Exchange Control
Under Semantic Overlap

Gallery

When references share highly overlapping semantic attributes, text prompts alone are often too coarse to encode fine-grained identity cues. Here we explicitly specify shot-level control: @Character1 denotes the top reference, and @Character2 denotes the bottom reference.

@Character1 Demo 5 Reference 1

@Character2 Demo 5 Reference 2

Shot1 → @Character1

Shot2 → @Character2

Shot3 → @Character1

Shot1 → @Character2

Shot2 → @Character1

Shot3 → @Character2

@Character1 Demo 6 Reference 1

@Character2 Demo 6 Reference 2