Falcon 40B

TII (UAE) · May 2023

legacyOpen Sourcedecoder onlytext
Parameters40B
Context Window2K tokens

Why It Matters

First model from outside the US/China to top the open-source LLM leaderboard, proving that high-quality training data (RefinedWeb) could be more important than sheer model size.

Description

A 40 billion parameter model from the UAE's Technology Innovation Institute, trained on RefinedWeb — a massive dataset of 1 trillion tokens of high-quality web text that was automatically filtered for quality. Released under the permissive Apache 2.0 license, it topped the Hugging Face Open LLM Leaderboard upon release, becoming the best open-source model in the world at that time.

Notable Milestones

  • Topped Hugging Face Open LLM Leaderboard on release
  • Pioneered the RefinedWeb dataset approach to web data curation

Key Innovations

Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Family Tree

Successors (1)

External Links