OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Abstract
OneStory generates coherent multi-shot videos by modeling global cross-shot context through a Frame Selection module and an Adaptive Conditioner, leveraging pretrained image-to-video models and a curated dataset.
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2025)
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives (2025)
- MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation (2025)
- ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation (2025)
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation (2025)
- Plan-X: Instruct Video Generation via Semantic Planning (2025)
- Scaling Zero-Shot Reference-to-Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper