news 2026-04-21 · huggingface-papers

🎬 OmniScript — A Small AI That Writes Full Scripts From Long Videos

Imagine watching a two-hour movie and having to document every scene — who said what, every facial expression, every background sound, down to the exact second.

That's the reality for editors, subtitle creators, and studio analysts every single day.

Now a research team has unveiled OmniScript, an 8-billion-parameter AI model that watches long cinematic videos and generates complete, scene-by-scene scripts — with precise timestamps.

What makes it remarkable: OmniScript processes both audio and visuals simultaneously. It captures dialogue, character actions, facial expressions, and sound cues, then organizes everything into a hierarchical script structure.

The model uses a progressive training pipeline with chain-of-thought reasoning and reinforcement learning with temporally segmented rewards — meaning it learns to pay attention to *when* things happen, not just *what* happens.

Despite being relatively small, OmniScript significantly outperforms larger open-source models and matches the performance of Gemini 3-Pro in temporal accuracy and script quality.

The implications are huge: automated scene indexing, instant script reconstruction, competitive film analysis, and accessible video understanding at scale — all from a model small enough to run without massive infrastructure.

The era of AI that truly "watches" movies is here.

📄 Source

huggingface-papers

← Previous

🤖 MultiWorld: The AI That Simulates Entire Worlds

🤖 AI Can Use Your Computer — But Can It Do It Twi