news 2026-04-23 · huggingface-papers

ReImagine — Generate Each Frame Beautifully First, Then Make It Move

Controllable human video generation has always forced a trade-off: you can control the pose, the camera angle, or the appearance — but rarely all three at once, because multi-view video training data barely exists.

ReImagine flips the conventional approach. Instead of training a video model end-to-end, it splits the problem into two stages:

**Stage 1: Image-first synthesis.** Given front and back reference photos plus target pose and camera angle, a fine-tuned FLUX Kontext model generates each frame independently as a high-quality still image. This leverages billions of training images — far more data than any video dataset.

**Stage 2: Training-free temporal refinement.** The generated frames are smoothed using 3D FFT spectral filtering and low-noise re-denoising through Wan 2.1 I2V-14B. No additional training required.

The results are striking: ReImagine achieves 2x better temporal consistency (FVD 0.275 vs 0.614) and substantially better visual quality (FID 36.23 vs 55.61) compared to video-first baselines. Users preferred it 42% of the time for view consistency.

What makes it practical: you can mix and match face, clothing, and shoe assets to compose entirely new characters. Two reference photos replace an entire motion capture studio. Code, weights, and dataset are all open-source.

For creators, fashion brands, and small studios, this is a path to controllable human video without expensive data or hardware.

📄 Source

huggingface-papers

← Previous

Learn in Public — The Side Benefit That Opens Door

Top 10 Uses for Codex at Work — OpenAI's Cloud Cod