๐ฌ OmniScript โ A Small AI That Writes Full Scripts From Long Videos
Imagine watching a two-hour movie and having to document every scene โ who said what, every facial expression, every background sound, down to the exact second.
That's the reality for editors, subtitle creators, and studio analysts every single day.
Now a research team has unveiled OmniScript, an 8-billion-parameter AI model that watches long cinematic videos and generates complete, scene-by-scene scripts โ with precise timestamps.
What makes it remarkable: OmniScript processes both audio and visuals simultaneously. It captures dialogue, character actions, facial expressions, and sound cues, then organizes everything into a hierarchical script structure.
The model uses a progressive training pipeline with chain-of-thought reasoning and reinforcement learning with temporally segmented rewards โ meaning it learns to pay attention to *when* things happen, not just *what* happens.
Despite being relatively small, OmniScript significantly outperforms larger open-source models and matches the performance of Gemini 3-Pro in temporal accuracy and script quality.
The implications are huge: automated scene indexing, instant script reconstruction, competitive film analysis, and accessible video understanding at scale โ all from a model small enough to run without massive infrastructure.
The era of AI that truly "watches" movies is here.
๐ Source
huggingface-papers