DistilBERT

Hugging Face · October 2019

activeOpen Sourceencoder onlytext
Parameters66M

Description

Hugging Face's distilled version of BERT that retains 97% of BERT's language understanding capability while being 60% smaller and 60% faster. One of the first successful applications of knowledge distillation to large language models.

Key Innovations

Distillation
DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.
Masked LM
Masked LMTraining by randomly hiding words and having the model predict them — BERT's key innovation for understanding context.

Family Tree

Built On

Lineage

BERTDistilBERT

External Links