OmniDirector masters extreme, high-difficulty camera maneuvers that go far beyond conventional pans and tilts. It can soar into the sky and dive underground, executing breathtaking aerial fly-throughs and dramatic descents while faithfully preserving scene geometry and visual fidelity. Left is the reference, and right is the generated result.
Multi Shot
OmniDirector supports multi-shot generation with coherent transitions and consistent content, while faithfully replicating camera relationships and shot compositions to preserve the original visual language across shots. Left is the reference, and right is the generated result.
Scene Generalization
OmniDirector demonstrates strong scene generalization, extending its capabilities across diverse domains including portraits, animals, wildlife, architecture, and AIGC content, ensuring robust performance without domain-specific constraints. Left is the reference, and right is the generated result.
Special Camera Movement
OmniDirector supports special camera techniques, including the Hitchcock zoom (dolly zoom), bullet time, and lens distortion effects. Left is the reference, and right is the generated result.
More Cases Exploration
Dive deeper into the generated multi-clip results with no bounding constraints.
More: Dynamic Motion
Left is the reference, and right is the generated result.
More: Multi Shot
Left is the reference, and right is the generated result.
More: Scene Generalization
Left is the reference, and right is the generated result.
More: Special Camera Movement
Left is the reference, and right is the generated result.
More: Comparisons
Side-by-side performance comparison against other state-of-the-art camera control methods, where the leftmost column shows the reference images. OmniDirector exhibits superior control stability and minimal object distortion. Left is the reference, and right is the generated result.
Pipeline
Overview of OmniDirector. Top: OmniDirector represents camera motion via a camera grid G, which is obtained by rendering the camera poses of a reference video V as movement within an empty 3D space. Middle: During training, the camera grid is injected into the MMDiT alongside other control signals via token concatenation. Bottom: At inference, a PE Agent harmoniously integrates various signals into the text prompt, achieving unified multi-signal control.
Acknowledgement
We sincerely thank Mingyang Shan, Fanqi Meng, Wanqi Shi, and Jiaxin Hu for contributing to the evaluation part.
Responsible Use Statement
The images and audios presented in these demos are either sourced from public domains or generated by our models. They are intended solely for showcasing the capabilities of our research framework—particularly, how it produces corresponding expressions and motions in response to diverse inputs, highlighting the framework’s technical strengths and academic value.If you have any concerns regarding the content, please feel free to contact us at mengzijie03@kuaishou.com, and we will promptly remove the material if necessary.
BibTeX
@article{liu2025omnidirector,
title={OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data},
author={Liu, Jiwen and Li, Shujuan and Fang, Zhixue and Li, Xiaohan and Zhou, Yan and Meng, Zijie and Zhang, Zhimin and Luo, Yawen and Zhang, Guoxin and Liu, Yu-Shen and Wan, Pengfei},
journal={arXiv preprint},
year={2026}
}