๐๏ธ VibeVoice โ Microsoft Open-Sources Voice AI That Handles 1-Hour Audio
What if AI could listen to your entire hour-long meeting, identify every speaker, and transcribe everything โ for free?
Microsoft just dropped VibeVoice, a family of open-source frontier voice AI models under the MIT license that handles both speech-to-text and text-to-speech at a level that rivals proprietary solutions.
๐ฏ What it does:
- **ASR (7B)** โ Transcribes 60 minutes of audio in a single pass with speaker identification, timestamps, and 50+ language support
- **TTS (1.5B)** โ Generates up to 90 minutes of natural speech with 4 distinct speakers โ perfect for podcasts and dialogues
- **Streaming (0.5B)** โ Real-time TTS with ~300ms latency, ideal for voice assistants and chatbots
The secret sauce? Ultra-low frame rate speech tokenizers running at 7.5 Hz combined with a next-token diffusion framework. Translation: it understands context like an LLM but produces audio quality like a diffusion model.
Imagine recording a team meeting and getting a perfectly formatted transcript โ who said what, when, with key points highlighted. Or creating a multi-voice AI podcast that sounds genuinely human.
All models are available on Hugging Face. MIT licensed. No strings attached.
๐ Source
github-trending