Phi-2
Microsoft Research · December 2023
● activeOpen Sourcedecoder onlytext
Parameters2.7B
Context Window2K tokens
Why It Matters
Microsoft's proof that small models trained on high-quality data could outperform models 25x their size. Fundamentally challenged the assumption that bigger always means better.
Description
A 2.7 billion parameter model that matched or outperformed models 5-10x its size on reasoning and language benchmarks. Built on the same philosophy as Phi-1 — using carefully selected, high-quality training data instead of brute-force scale. Proved that small models could rival much larger ones when trained smartly.
Notable Milestones
- ▸Outperformed Llama 2 70B on some benchmarks despite being 25x smaller
- ▸Helped establish the small language model category
Key Innovations
Distillation
DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.