Tülu 3

Allen Institute for AI · November 2024

activeOpen Sourcedecoder onlytext
Parameters8B - 70B
Context Window128K tokens
Variants8B, 70B

Why It Matters

Allen AI's instruction-tuned model that proved transparent post-training (DPO + PPO) could match proprietary RLHF quality.

Description

Allen AI's instruction-tuned model family, available in 8B and 70B parameter sizes with a 128K token context window. Fine-tuned using transparent post-training techniques including DPO (Direct Preference Optimization — a method that teaches the model human preferences without needing a separate reward model) and PPO (Proximal Policy Optimization — a reinforcement learning method that gradually improves the model's responses based on human feedback).

Key Innovations

Instruction Tuning
Instruction TuningFine-tuning a model on instruction-response pairs so it follows user commands more reliably.
RLHF
RLHFReinforcement Learning from Human Feedback — training models to align with human preferences by having humans rank outputs.

Family Tree

Built On

Lineage

OLMoOLMo 2Tülu 3

Related Research (1)

DPOAlignment
2023 · Stanford University

Showed that preference learning could be formulated as a simple classification problem on pairs of outputs, eliminating the need for a separate reward…