TH
← Back
news 2026-04-22 · HuggingFace Daily Papers

WorldMark: The First Fair Fight Between AI World Simulators

Imagine pressing WASD on your keyboard and watching an AI generate a 3D world you can explore in real time. That's what interactive video world models do — but until now, nobody could tell which one actually does it best.

Every major AI world model — Google's Genie 3, YUME 1.5, HY-World 1.5, Matrix-Game 2.0, and others — has been testing itself on its own private benchmark with its own scenes and metrics. It's like athletes competing on different tracks and all claiming to be champion.

Researchers from Alaya Studio and the University of Tokyo have built WorldMark, the first standardized benchmark designed for fair, apples-to-apples comparison of interactive image-to-video world models. It features a unified action-mapping layer that translates standard WASD keyboard inputs into each model's native control format, a test suite of 500 evaluation cases spanning multiple viewpoints and difficulty levels, and an eight-metric evaluation toolkit covering visual quality, control alignment, and world consistency.

The findings are striking. YUME 1.5 produces the most visually stunning frames but ranks poorly on long-term world coherence. Google's Genie 3 maintains the most consistent worlds but with only moderate visual quality. Visual beauty and world consistency, it turns out, are largely uncorrelated.

Perhaps most revealing: when switching from first-person to third-person view, Matrix-Game 2.0's rotation error explodes by roughly 20x — showing that camera control around a visible character remains an unsolved challenge.

The team also launched World Model Arena, an online platform where anyone can pit world models against each other in side-by-side battles with a live leaderboard. All data, code, and model outputs will be publicly released.

📄 Source

HuggingFace Daily Papers
Share: Facebook 𝕏
← Previous
🎨 Best Sampler & Scheduler? The Stable Diffusion
Next →
🎭 StyleID: AI That Recognizes Your Face Even as a