VALL-E

Microsoft · January 2023

● activeClosedaudio

Why It Matters

Reframed speech synthesis as a language modeling problem, showing that the same autoregressive approach powering LLMs could generate remarkably natural speech from just a 3-second voice sample.

Description

Microsoft's neural codec language model for text-to-speech that can clone any voice from just a 3-second audio sample. Treats speech synthesis as a language modeling problem, generating audio codec codes from text and a brief voice prompt, enabling zero-shot voice cloning with remarkable fidelity.

Key Innovations

Text-to-Audio

Text-to-AudioGenerating speech, music, or sound effects from text descriptions.

Zero-Shot

Zero-ShotPerforming tasks without any examples — the model generalizes from its training alone.

speech-synthesis

External Links

Research Paper

More from Speech / Voice

ElevenLabs2023-01 · —

PreviousElevenLabs