Reasoning-Driven · Controllable Video Generation

CogOmniControl
Reasoning-Driven Controllable Video Generation
via Creative Intent Cognition

1SKL-IOTSC, CIS, University of Macau · 2Online-Video BU, Tencent
*Equal contribution. Corresponding author.
Teaser figure / overview will be placed here drop assets/static/teaser.{svg|png|jpg|pdf}
Figure 1. CogOmniControl factorizes controllable video generation into creative-intent cognition and generation. A specialized CogVLM transforms sparse / abstract conditions into dense reasoning, which guides CogOmniDiT to synthesize videos that faithfully match user intent.

Abstract

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay-render conditions. Existing video-generation models either inject conditions through adapters or couple a generic vision-language model (VLM) with a diffusion backbone, leaving a capability gap and failing to produce videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative-intent cognition and generation. We train a specialized CogVLM using authentic anime-production data; compared to generic VLMs, it produces more professional and clearer outputs, accurately cognizing user intent from sparse and abstract conditions. CogOmniDiT unifies controls from heterogeneous conditions through in-context generation and is aligned with the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection, transforming the entire framework into a closed-loop "harness-like" architecture. We also introduce CogReasonBench and CogControlBench, built from professional workflow data carrying genuine creative intent. Experiments on the two benchmarks show that CogOmniControl surpasses existing open-source models.

Method

CogOmniControl is a closed-loop harness: a reasoning VLM cognizes the user's creative intent from heterogeneous conditions, a unified DiT generates videos in-context, and an evaluator harness selects the best output among candidates.

01

CogVLM
Creative-Intent Cognition

A specialized vision-language model fine-tuned with SFT + RFT on real anime-production data. It maps sparse conditions (storyboards, clay-renders, ref images, prompts) to dense reasoning covering visual consistency, motion smoothness, special effects, static-to-dynamic transitions and creative intent.

02

CogOmniDiT
Unified Video DiT

A single transformer backbone consumes heterogeneous conditions in-context: Concat(Zt, Zref, Zctrl, EmbVLM). Self-attention models cross-condition interactions, enabling precise control under abstract or partially-missing inputs.

03

RL Alignment
Holistic + Accuracy Rewards

We align CogVLM's reasoning with downstream generation through reinforcement fine-tuning, optimizing creative intent, physical plausibility, information integrity and motion description, with an additional atomic-fact accuracy reward that grounds outputs and suppresses hallucination.

04

Evaluator Harness
Closed-Loop Best-of-N

CogVLM additionally plans an adaptive evaluator set per input from a tool library, scoring N candidate videos and selecting the one that best fulfills the inferred intent — turning the whole pipeline into a self-verifying closed loop.

Pipeline diagram will be placed here drop assets/static/pipeline.pdf
Figure 2. Overall pipeline of CogOmniControl.

Qualitative Results

Each example below shows the optional Control Video and Reference Image conditions, the Text Prompt, and the Generated Video.

More Results

Example 04
Control
control
example5/control.mp4
Ref
ref
Prompt

High-quality dark fantasy game cinematic in a top-down isometric perspective featuring a group of burly, muscular Orc or Barbarian warriors with bronze skin and primitive armor trekking along a rugged wilderness path; one warrior pushes a wooden wheelbarrow filled with hay while the terrain of yellow soil and moss gives way to mist-shrouded hills, captured with a slow camera pan to the right.

Generated Ours
final
example5/output.mp4
Example 05
Control
control
example6/control.mp4
Ref
ref
Prompt

In a misty, forested area by a serene lake, a wooden dock extends into the water, adorned with several small boats tied to it. To the right, two round tents are nestled among the trees and vegetation. A group of people on horseback moves along a dirt path leading towards the dock, gradually approaching it. Meanwhile, a separate group of individuals walks on the dock, heading in the same direction. As the scene progresses, the group on horseback continues to advance, getting closer to the dock, while the walkers on the dock become more prominent. The mist adds a mystical ambiance to the tranquil setting, with the gentle ripples on the lake reflecting the soft light filtering through the trees.

Generated Ours
final
example6/output.mp4
Example 06
Control
control
example7/control.mp4
Ref
ref
Prompt

A woman stands in front of a building, surrounded by lush green plants and trees. She is wearing a light-colored shirt with a floral pattern and has long, dark hair. The woman gestures with her hands while speaking, moving her right hand up and down and occasionally touching her hair. She continues to gesture and speak, occasionally pausing and then resuming her gestures. Finally, she smiles while still standing in the same location. The camera remains stationary throughout the sequence, capturing the woman's expressive movements and the serene outdoor setting. The background features a well-maintained garden with various plants and trees, adding a natural and tranquil ambiance to the scene. The lighting suggests it is daytime, with sunlight filtering through the foliage, casting soft shadows on the ground. The overall atmosphere is calm and inviting, highlighting the woman's engaging presence and the picturesque surroundings.

Generated Ours
final
example7/output.mp4
Example 07
Control
control
example8/control.mp4
Ref
ref
Prompt

In a vibrant garden setting, an orange tabby cat stands on a brick structure adorned with red flowers and green foliage. The cat, with its sleek fur and expressive green eyes, initially looks down with a serious expression before raising its head and beginning to speak. It wipes its mouth with its paw while continuing its dialogue, occasionally closing its eyes briefly before opening them again. The cat then turns its head to the side and turns its body around, walking away from the brick structure and moving out of the frame. The background features a mix of red flowers and green leaves, creating a lively and colorful backdrop. The camera remains stationary throughout, capturing the cat's movements and expressions in detail.

Generated Ours
final
example8/output.mp4

BibTeX

@article{cogomnicontrol2026,
  title   = {CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition},
  author  = {Yang, Hongji and Li, Songlian and Zhou, Yucheng and Zhao, Xiaotong
             and Zhao, Alan and Xu, Chengzhong and Shen, Jianbing},
  year    = {2026}
}