Megatron-Turing NLG

NVIDIA / Microsoft · October 2021

activeCloseddecoder onlytext
Parameters530B

Why It Matters

First 530B parameter model, proving massive scale was viable and laying groundwork for NVIDIA's AI model ambitions.

Description

A joint project between NVIDIA and Microsoft that produced the largest dense transformer model at the time of its release — 530 billion parameters. Dense means every parameter is used for every computation, unlike later mixture-of-experts models that only activate a subset. It demonstrated that massive scale was achievable with the right hardware and software infrastructure.

Key Innovations

Autoregressive
AutoregressiveGenerates text one token at a time, each prediction based on all previous tokens. The foundation of modern language models.
Scaling Laws
Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.

Family Tree

Successors (1)

Related Research (1)

2019 · NVIDIA

Pioneered efficient model parallelism techniques enabling training of multi-billion parameter Transformers across GPUs.

External Links