news 2026-04-07 · github-trending

🎙️ VibeVoice — Microsoft Open-Sources Voice AI That Handles 1-Hour Audio

What if AI could listen to your entire hour-long meeting, identify every speaker, and transcribe everything — for free?

Microsoft just dropped VibeVoice, a family of open-source frontier voice AI models under the MIT license that handles both speech-to-text and text-to-speech at a level that rivals proprietary solutions.

🎯 What it does:

**ASR (7B)** — Transcribes 60 minutes of audio in a single pass with speaker identification, timestamps, and 50+ language support
**TTS (1.5B)** — Generates up to 90 minutes of natural speech with 4 distinct speakers — perfect for podcasts and dialogues
**Streaming (0.5B)** — Real-time TTS with ~300ms latency, ideal for voice assistants and chatbots

The secret sauce? Ultra-low frame rate speech tokenizers running at 7.5 Hz combined with a next-token diffusion framework. Translation: it understands context like an LLM but produces audio quality like a diffusion model.

Imagine recording a team meeting and getting a perfectly formatted transcript — who said what, when, with key points highlighted. Or creating a multi-voice AI podcast that sounds genuinely human.

All models are available on Hugging Face. MIT licensed. No strings attached.

📄 Source

github-trending

← Previous

🕵️ Microsoft Catches Poisoned AI Models — They Wo

🔍 MinerU2.5-Pro Hits 95.69% on OmniDocBench — A N