VALL-E
Microsoft · January 2023
● activeClosedaudio
Why It Matters
Reframed speech synthesis as a language modeling problem, showing that the same autoregressive approach powering LLMs could generate remarkably natural speech from just a 3-second voice sample.
Description
Microsoft's neural codec language model for text-to-speech that can clone any voice from just a 3-second audio sample. Treats speech synthesis as a language modeling problem, generating audio codec codes from text and a brief voice prompt, enabling zero-shot voice cloning with remarkable fidelity.
Key Innovations
Text-to-Audio
Text-to-AudioGenerating speech, music, or sound effects from text descriptions.
Zero-Shot
Zero-ShotPerforming tasks without any examples — the model generalizes from its training alone.
speech-synthesis