Mixtral 8x7B
Mistral AI · December 2023
Why It Matters
Proved that Mixture-of-Experts architecture could match GPT-3.5 quality at a fraction of the compute cost, making frontier-level AI accessible to run on consumer hardware.
Description
An open-source Mixture-of-Experts (MoE) model containing 8 specialized sub-networks ('experts') with 46.7B total parameters, but only activating 2 experts (12.9B parameters) for each piece of text. This makes it as fast as a 13B model while delivering GPT-3.5-level quality — proving that clever architecture can substitute for raw size.
Notable Milestones
- ▸Matched GPT-3.5 Turbo quality as a fully open model
- ▸Demonstrated MoE efficiency: 13B-speed with 47B-quality
- ▸Widely adopted for self-hosted enterprise deployments
Benchmark Scores
Key Innovations
Family Tree
Built On
Lineage
Successors (2)
Related Research (7)
Introduced sparsely-gated Mixture-of-Experts layers for scaling model capacity without proportional compute increase.
Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.
Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …
Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…
Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.
Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.