TH
← Back
news 2026-04-24 · sd-reddit

LLaDA2.0-Uni — The First Diffusion Language Model That Both Understands and Creates Images

LLaDA2.0-Uni — The First Diffusion Language Model That Both Understands and Creates Images

What if a single AI model could look at your photo, understand it, and then create a new image based on that understanding — without switching between tools?

InclusionAI (Ant Group) just released LLaDA2.0-Uni, an open-source model that unifies multimodal understanding and generation through a novel masked diffusion approach.

Instead of generating tokens one by one like conventional language models, LLaDA2.0-Uni fills in multiple blanks simultaneously across both text and images. This parallel approach makes it significantly faster while maintaining quality.

What it can do:

The architecture combines a semantic discrete tokenizer (SigLIP-VQ), a Mixture-of-Experts backbone, and a distilled diffusion decoder. It matches specialized vision-language models in understanding benchmarks while delivering competitive image generation.

This represents a significant shift: instead of stitching together separate models for understanding and generation, LLaDA2.0-Uni handles both natively in a single unified framework — making it especially interesting for developers building multimodal applications.

The model and code are available on GitHub and Hugging Face.

📄 Source

sd-reddit
Share: Facebook 𝕏
← Previous
Is Automatic1111 Still Worth Using? The Community
Next →
One Dev Runs Bluesky's For You Feed From a Living