CogOmniControl: Reasoning-Driven Controllable Video Generation

Abstract

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay-render conditions. Existing video-generation models either inject conditions through adapters or couple a generic vision-language model (VLM) with a diffusion backbone, leaving a capability gap and failing to produce videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative-intent cognition and generation. We train a specialized CogVLM using authentic anime-production data; compared to generic VLMs, it produces more professional and clearer outputs, accurately cognizing user intent from sparse and abstract conditions. CogOmniDiT unifies controls from heterogeneous conditions through in-context generation and is aligned with the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection, transforming the entire framework into a closed-loop "harness-like" architecture. We also introduce CogReasonBench and CogControlBench, built from professional workflow data carrying genuine creative intent. Experiments on the two benchmarks show that CogOmniControl surpasses existing open-source models.

Method

CogOmniControl is a closed-loop harness: a reasoning VLM cognizes the user's creative intent from heterogeneous conditions, a unified DiT generates videos in-context, and an evaluator harness selects the best output among candidates.

01

CogVLM
Creative-Intent Cognition

A specialized vision-language model fine-tuned with SFT + RFT on real anime-production data. It maps sparse conditions (storyboards, clay-renders, ref images, prompts) to dense reasoning covering visual consistency, motion smoothness, special effects, static-to-dynamic transitions and creative intent.

02

CogOmniDiT
Unified Video DiT

A single transformer backbone consumes heterogeneous conditions in-context: Concat(Z_t, Z_ref, Z_ctrl, Emb_VLM). Self-attention models cross-condition interactions, enabling precise control under abstract or partially-missing inputs.

03

RL Alignment
Holistic + Accuracy Rewards

We align CogVLM's reasoning with downstream generation through reinforcement fine-tuning, optimizing creative intent, physical plausibility, information integrity and motion description, with an additional atomic-fact accuracy reward that grounds outputs and suppresses hallucination.

04

Evaluator Harness
Closed-Loop Best-of-N

CogVLM additionally plans an adaptive evaluator set per input from a tool library, scoring N candidate videos and selecting the one that best fulfills the inferred intent — turning the whole pipeline into a self-verifying closed loop.

Pipeline diagram will be placed here drop assets/static/pipeline.pdf

Figure 2. Overall pipeline of CogOmniControl.

Qualitative Results

Each example below shows the optional Control Video and Reference Image conditions, the Text Prompt, and the Generated Video.

Example 01

Reference-Driven Generation

Reference Image

ref image
assets/examples/example2/ref.jpg

Text Prompt

A fluffy cat is walking down a cobblestone street, heading towards the camera. The cat's fur is soft and well-groomed, with long whiskers and bright yellow eyes. It appears to be in mid-stride, with one paw slightly raised. The background features traditional buildings with hanging signs and lanterns, suggesting an Asian setting. There are people in the distance, but they are blurred, drawing attention to the cat. The camera is positioned at ground level, capturing the cat from the front and emphasizing its movement. The lighting is natural and warm, creating a cozy and inviting atmosphere. The scene is in real-life.

Generated Video Ours

final video
assets/examples/example2/output.mp4

Example 02

Clay Render → Video

Control Video

control video
assets/examples/example3/control.mp4

Reference Image

ref image
assets/examples/example3/ref.jpg

Text Prompt

Highly cinematic realistic image, depicting a busy East Asian city street in the rain, with a group of pedestrians with varying expressions distributed in the center and foreground. In the foreground is a woman in a blue translucent hooded raincoat, her expression focused. In the middle, a man in a long, fully transparent raincoat is holding up a smartphone to take a picture, and next to him, another man in a purple jacket is also using his phone to record. Among the crowd are an efficient-looking woman in a black leather jacket, a man in a yellow raincoat, and a unique elderly person leaning forward. The background is a typical Chinese street scene, hanging with red and blue neon signs for "Noodle Shop" and "Supermarket", with slightly aged architectural styles. The ground is paved with gray stone bricks, wet and coldly shimmering due to the rain, clearly reflecting the neon lights and the figures of the pedestrians. The sky is drizzling, and the dense rain streaks are clearly visible under the lights. The overall tone is cold, full of story and a suppressed urban atmosphere, with extremely detailed image quality, realistic light and shadow representation, presenting a visual effect similar to movie stills. The camera pans slowly from right to left.

Generated Video Ours

final video
assets/examples/example3/output.mp4

Example 03

Featured Generation

Control Video

control video
assets/examples/example4/control.mp4

Reference Image

ref image
assets/examples/example4/ref.jpg

Text Prompt

A group of fishing boats sails across the vast ocean, their wakes trailing behind them. The camera gradually moves forward, capturing the serene yet mysterious atmosphere created by the dark clouds and the rising plumes of black smoke in the distance. The boats, scattered across the water, appear small against the expansive horizon, emphasizing the scale and isolation of the scene. The muted tones and subtle details in the image evoke a sense of quiet tension, hinting at an underlying narrative or event unfolding beyond the frame.

Generated Video Ours

final video
assets/examples/example4/output.mp4

More Results

Example 04

Control

control
example5/control.mp4

Ref

ref
example5/ref.jpg

Prompt

High-quality dark fantasy game cinematic in a top-down isometric perspective featuring a group of burly, muscular Orc or Barbarian warriors with bronze skin and primitive armor trekking along a rugged wilderness path; one warrior pushes a wooden wheelbarrow filled with hay while the terrain of yellow soil and moss gives way to mist-shrouded hills, captured with a slow camera pan to the right.

Generated Ours

final
example5/output.mp4

Example 05

Control

control
example6/control.mp4

Ref

ref
example6/ref.jpg

Prompt

In a misty, forested area by a serene lake, a wooden dock extends into the water, adorned with several small boats tied to it. To the right, two round tents are nestled among the trees and vegetation. A group of people on horseback moves along a dirt path leading towards the dock, gradually approaching it. Meanwhile, a separate group of individuals walks on the dock, heading in the same direction. As the scene progresses, the group on horseback continues to advance, getting closer to the dock, while the walkers on the dock become more prominent. The mist adds a mystical ambiance to the tranquil setting, with the gentle ripples on the lake reflecting the soft light filtering through the trees.

Generated Ours

final
example6/output.mp4

Example 06

Control

control
example7/control.mp4

Ref

ref
example7/ref.jpg

Prompt

A woman stands in front of a building, surrounded by lush green plants and trees. She is wearing a light-colored shirt with a floral pattern and has long, dark hair. The woman gestures with her hands while speaking, moving her right hand up and down and occasionally touching her hair. She continues to gesture and speak, occasionally pausing and then resuming her gestures. Finally, she smiles while still standing in the same location. The camera remains stationary throughout the sequence, capturing the woman's expressive movements and the serene outdoor setting. The background features a well-maintained garden with various plants and trees, adding a natural and tranquil ambiance to the scene. The lighting suggests it is daytime, with sunlight filtering through the foliage, casting soft shadows on the ground. The overall atmosphere is calm and inviting, highlighting the woman's engaging presence and the picturesque surroundings.

Generated Ours

final
example7/output.mp4

Example 07

Control

control
example8/control.mp4

Ref

ref
example8/ref.jpg

Prompt

In a vibrant garden setting, an orange tabby cat stands on a brick structure adorned with red flowers and green foliage. The cat, with its sleek fur and expressive green eyes, initially looks down with a serious expression before raising its head and beginning to speak. It wipes its mouth with its paw while continuing its dialogue, occasionally closing its eyes briefly before opening them again. The cat then turns its head to the side and turns its body around, walking away from the brick structure and moving out of the frame. The background features a mix of red flowers and green leaves, creating a lively and colorful backdrop. The camera remains stationary throughout, capturing the cat's movements and expressions in detail.

Generated Ours

final
example8/output.mp4

BibTeX

@article{cogomnicontrol2026,
  title   = {CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition},
  author  = {Yang, Hongji and Li, Songlian and Zhou, Yucheng and Zhao, Xiaotong
             and Zhao, Alan and Xu, Chengzhong and Shen, Jianbing},
  year    = {2026}
}

Abstract

Method

CogVLMCreative-Intent Cognition

CogOmniDiTUnified Video DiT

RL AlignmentHolistic + Accuracy Rewards

Evaluator HarnessClosed-Loop Best-of-N

Qualitative Results

More Results

BibTeX

CogVLM
Creative-Intent Cognition

CogOmniDiT
Unified Video DiT

RL Alignment
Holistic + Accuracy Rewards

Evaluator Harness
Closed-Loop Best-of-N