Academic Paper Archive

Research Papers

The science behind AI — from backpropagation to transformers, scaling laws to reasoning. Every breakthrough paper that shaped the models we use today.

Tags:

Showing 46 of 46 papers

Pre-Transformer

1986 – 2016

The foundations — neural networks, memory, and representation

📄Backpropagation

Learning representations by back-propagating errors

Pre-Transformer

Rumelhart, Hinton, Williams·Various (UC San Diego, CMU)·1986

Introduced backpropagation through time for training recurrent neural networks, enabling sequential data processing.

📄LSTM

Long Short-Term Memory

Pre-Transformer

Hochreiter, Schmidhuber·TU Munich / IDSIA·1997

Proposed gating mechanisms (input, forget, output gates) to solve the vanishing gradient problem in RNNs, enabling learning over long sequences.

Builds on:

📄Word2Vec

Efficient Estimation of Word Representations in Vector Space

Pre-Transformer

Mikolov et al.·Google·2013

Developed efficient word embeddings via skip-gram and CBOW architectures, capturing semantic relationships in dense vector representations.

📄Seq2Seq

Sequence to Sequence Learning with Neural Networks

Pre-Transformer

Sutskever, Vinyals, Le·Google·2014

Introduced encoder-decoder RNN architectures for machine translation, establishing the pattern for generative sequence-to-sequence tasks.

Builds on:

📄Attention Mechanism

Neural Machine Translation by Jointly Learning to Align and Translate

Pre-Transformer

Bahdanau, Cho, Bengio·University of Montreal·2014

Added attention to seq2seq models, allowing dynamic focus on relevant input parts rather than compressing everything into a fixed vector.

Attention

Builds on:

📄GANs

Generative Adversarial Networks

Pre-Transformer

Ian Goodfellow et al.·University of Montreal·2014-06

Introduced generative adversarial networks — two neural networks competing against each other (a generator creates fake data, a discriminator tries to detect fakes) which produces increasingly realistic outputs.

Why It Matters

Invented the generative adversarial framework that powered the first wave of AI image generation and inspired all subsequent generative models.

Autoregressive

Transformer

2017 – 2020

The architecture that changed everything

📄Transformer

Attention Is All You Need

Transformer

Vaswani et al.·Google Brain·2017-06

Introduced the Transformer architecture using self-attention mechanisms, replacing RNNs entirely. Enabled parallel training and superior long-range dependency modeling.

TransformerAttention

Related Models:

BERT GPT-1 GPT-2 GPT-3 LLaMA PaLM

Builds on:

📄GPT-1

Improving Language Understanding by Generative Pre-Training

Transformer

Radford et al.·OpenAI·2018-06

First decoder-only Transformer pretrained generatively on BooksCorpus. Demonstrated zero-shot transfer learning via fine-tuning.

AutoregressiveTransformer

Related Models:

GPT-1

Builds on:

📄BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Transformer

Devlin et al.·Google·2018-10

Encoder-only bidirectional pretraining with masked language modeling (MLM) and next-sentence prediction. Set SOTA on GLUE benchmarks.

Masked LMTransformer

Related Models:

BERT

Builds on:

📄T5

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transformer

Raffel et al.·Google·2019

Framed all NLP tasks as text-to-text problems. Scaled pretraining and fine-tuning systematically across tasks.

Transformer

Builds on:

📄GPT-2

Language Models are Unsupervised Multitask Learners

Transformer

Radford et al.·OpenAI·2019-02

Scaled GPT to 1.5B parameters on WebText. Demonstrated emergent unsupervised multitask learning without task-specific fine-tuning.

AutoregressiveZero-Shot

Related Models:

GPT-2

Builds on:

📄RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Transformer

Liu et al.·Meta AI·2019-07

Showed BERT was significantly undertrained; by optimizing training procedure (more data, longer training, no NSP), achieved much better results.

Why It Matters

Demonstrated that training methodology matters as much as architecture. Influenced all subsequent pretraining optimization research.

Masked LMScaling Laws

Builds on:

📄GPT-3

Language Models are Few-Shot Learners

Transformer

Brown et al.·OpenAI·2020-05

175B-parameter GPT. Pioneered few-shot and in-context learning, dramatically reducing the need for fine-tuning.

Few-ShotAutoregressiveScaling Laws

Builds on:

📄Vision Transformer (ViT)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Transformer

Alexey Dosovitskiy et al.·Google Brain·2020-10

Applied the Transformer architecture directly to images by splitting them into patches (16x16 pixel squares) and treating each patch like a word token, eliminating the need for convolutional neural networks.

Why It Matters

Brought Transformers to computer vision. Every modern multimodal model (GPT-4V, Gemini, Claude Vision) descends from this insight.

MultimodalAttention

Builds on:

📄CLIP

Learning Transferable Visual Models From Natural Language Supervision

Transformer

Alec Radford et al.·OpenAI·2021-01

Trained a model to understand both images and text by learning which image-text pairs go together from 400 million internet examples. This created a shared 'embedding space' where images and text can be directly compared.

Why It Matters

Connected the world of images and text. Powers image search, DALL·E's text understanding, Stable Diffusion's guidance, and most multimodal AI systems.

MultimodalZero-Shot

Related Models:

DALL·E Stable Diffusion 1.5

Builds on:

📄Whisper

Robust Speech Recognition via Large-Scale Weak Supervision

Transformer

Alec Radford et al.·OpenAI·2022-09

Trained a speech recognition model on 680,000 hours of multilingual audio from the internet, achieving near-human accuracy across 97 languages without any task-specific fine-tuning.

Why It Matters

Made accurate speech-to-text accessible to everyone as an open model. Used in podcasting, accessibility, real-time translation, and meeting transcription worldwide.

Speech Recognition

Related Models:

Whisper

Builds on:

Scaling

2020 – 2023

How big should models be? The laws of scale

📄Sparse MoE

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Scaling

Shazeer et al.·Google·2017

Introduced sparsely-gated Mixture-of-Experts layers for scaling model capacity without proportional compute increase.

MoE

Related Models:

Mixtral 8x7B GPT-4

Builds on:

📄Megatron-LM

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Scaling

Shoeybi et al.·NVIDIA·2019-09

Pioneered efficient model parallelism techniques enabling training of multi-billion parameter Transformers across GPUs.

Why It Matters

Established the blueprint for distributed training at scale. Megatron's parallelism strategies underpin most large-model training frameworks today.

Scaling Laws

Builds on:

📄Scaling Laws (Kaplan)

Scaling Laws for Neural Language Models

Scaling

Kaplan et al.·OpenAI·2020

Found that model performance follows power laws in compute, parameters, and data. Provided the mathematical framework for scaling decisions.

Scaling Laws

Related Models:

GPT-3 GPT-4 LLaMA LLaMA 2

Builds on:

📄Switch Transformers

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Scaling

Fedus et al.·Google·2021

Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.

MoE

Related Models:

Mixtral 8x7B GPT-4 DeepSeek V2

Builds on:

📄LoRA

LoRA: Low-Rank Adaptation of Large Language Models

Scaling

Edward J. Hu et al.·Microsoft Research·2021-06

Proposed freezing the original model weights and injecting small trainable low-rank matrices, reducing fine-tuning memory by 10,000x while maintaining quality.

Why It Matters

Democratized AI fine-tuning — made it possible to customize billion-parameter models on consumer GPUs. Used in virtually every open-source model adaptation today.

Distillation

Builds on:

📄Chinchilla

Training Compute-Optimal Large Language Models

Scaling

Hoffmann et al.·DeepMind·2022

Challenged Kaplan's scaling laws by showing data should scale equally to parameters. 70B Chinchilla outperformed 280B Gopher.

Scaling Laws

Related Models:

Chinchilla LLaMA LLaMA 2 Mistral 7B

Builds on:

📄LLaMA

LLaMA: Open and Efficient Foundation Language Models

Scaling

Hugo Touvron et al.·Meta AI·2023-02

Showed that smaller models trained on significantly more data (following Chinchilla scaling laws) could match or exceed the performance of much larger models, and released the weights openly.

Why It Matters

Kicked off the open-source LLM revolution. LLaMA's leak and subsequent open release spawned Alpaca, Vicuna, and hundreds of community models — democratizing access to frontier AI.

Open WeightScaling Laws

Related Models:

LLaMA LLaMA 2 LLaMA 3

Builds on:

📄GPT-4

GPT-4 Technical Report

Scaling

OpenAI·OpenAI·2023-03

Described GPT-4's multimodal capabilities and performance across professional/academic benchmarks, setting new SOTA on bar exam, MMLU, and many others.

Why It Matters

First model to convincingly pass professional exams and demonstrate broad multimodal reasoning, catalyzing widespread enterprise AI adoption.

MultimodalScaling LawsRLHF

Related Models:

GPT-4 GPT-4 Turbo GPT-4o GPT-4o Mini

Builds on:

📄LLaMA 2

Llama 2: Open Foundation and Fine-Tuned Chat Models

Scaling

Hugo Touvron et al.·Meta AI·2023-07

Provided the most detailed public documentation of how to train, fine-tune, and safety-align a large language model, including their full RLHF methodology.

Why It Matters

The 'recipe book' for the open-source LLM community. Its detailed training methodology was copied by virtually every open model that followed.

Open WeightRLHFInstruction Tuning

Related Models:

LLaMA 2 CodeLlama

Builds on:

📄Mistral 7B

Scaling

Jiang et al.·Mistral AI·2023-10

Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.

Why It Matters

Proved that architectural innovations matter more than raw scale. Kickstarted Mistral AI as a major open-weight competitor.

Long ContextAttention

Related Models:

Mistral 7B Mixtral 8x7B Mistral Large 2

Builds on:

📄Gemini

Gemini: A Family of Highly Capable Multimodal Models

Scaling

Gemini Team, Google DeepMind·Google DeepMind·2023-12

Introduced the Gemini family with native multimodal training from the ground up, achieving SOTA on 30+ benchmarks.

Why It Matters

First model family natively trained on interleaved text, image, audio, and video from inception, rather than bolting on vision post hoc.

MultimodalScaling LawsMoE

Builds on:

Architecture

2020 – 2024

New ways to build: efficiency and alternatives

📄SwiGLU

GLU Variants Improve Transformer

Architecture

Shazeer·Google·2020-02

Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.

Why It Matters

SwiGLU became the default FFN activation in PaLM, LLaMA, Mistral, and most modern Transformers, replacing ReLU and GELU.

Transformer

Builds on:

📄GShard

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Architecture

Dmitry Lepikhin et al.·Google·2020-06

Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond single-device memory limits.

Why It Matters

Blueprint for scaling MoE to production. Directly influenced Mixtral, DeepSeek V2/V3, and all modern MoE models.

MoE

Related Models:

Mixtral 8x7B DeepSeek V2

Builds on:

📄RoPE

RoFormer: Enhanced Transformer with Rotary Position Embedding

Architecture

Su et al.·Zhuiyi Technology·2021-04

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every modern LLM.

Why It Matters

RoPE replaced absolute and learned position embeddings as the standard. LLaMA, Qwen, Mistral, and most modern models use it.

AttentionLong Context

Builds on:

📄Flash Attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Architecture

Tri Dao et al.·Stanford University·2022-05

Restructured how attention computation accesses GPU memory (tiling and recomputation), achieving 2-4x speedup and enabling much longer context windows without approximation.

Why It Matters

Made million-token context windows practical. Used in virtually every modern LLM — without Flash Attention, models like GPT-4 and Claude would be dramatically slower.

AttentionLong Context

Builds on:

📄GPTQ

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Architecture

Frantar et al.·IST Austria·2022-10

Pioneered post-training quantization to 4-bit for large language models with minimal quality loss, enabling consumer GPU inference.

Why It Matters

Democratized LLM inference by making it possible to run 30B+ parameter models on a single consumer GPU. Widely adopted by the open-source community.

Distillation

Builds on:

📄Grouped-Query Attention

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Architecture

Ainslie et al.·Google Research·2023-05

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality.

Why It Matters

GQA is now standard in LLaMA 2+, Mistral, Gemma, and nearly all efficient modern LLMs.

Attention

Builds on:

📄Mamba

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Architecture

Albert Gu, Tri Dao·Carnegie Mellon University / Princeton·2023-12

Introduced selective state space models that process sequences in linear time (vs. quadratic for Transformers), with a data-dependent selection mechanism that lets the model focus on relevant parts of the input.

Why It Matters

First serious architectural challenger to the Transformer. Inspired hybrid Mamba-Transformer models like Jamba and NVIDIA Nemotron 3.

Attention

Builds on:

📄DeepSeek-V2 / MLA

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Architecture

DeepSeek-AI·DeepSeek·2024-05

Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory needed to serve long-context models.

Why It Matters

MLA became the most important attention efficiency innovation since Flash Attention. Adopted by DeepSeek-V3/R1 and influenced the entire industry's approach to efficient inference.

MoEAttention

Related Models:

DeepSeek V2 DeepSeek V3 DeepSeek R1

Builds on:

Diffusion

2020 – 2023

From noise to art: the generative revolution

📄DDPM / Diffusion

Denoising Diffusion Probabilistic Models

Diffusion

Jonathan Ho, Ajay Jain, Pieter Abbeel·UC Berkeley·2020-06

Showed that gradually adding noise to data and then learning to reverse the process could generate images rivaling GANs, with more stable training and better diversity.

Why It Matters

Foundation of Stable Diffusion, DALL·E 2, and Imagen — replaced GANs as the dominant image generation paradigm.

Diffusion

Builds on:

📄DALL·E

Zero-Shot Text-to-Image Generation

Diffusion

Aditya Ramesh et al.·OpenAI·2021-01

Demonstrated that a single model could generate diverse, creative images from arbitrary text descriptions, combining language understanding with image generation.

Why It Matters

First major text-to-image model — proved AI could be genuinely creative, not just analytical.

Text-to-ImageZero-Shot

Related Models:

DALL·E DALL·E 2 DALL·E 3

Builds on:

📄Imagen

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Diffusion

Saharia et al.·Google Brain·2022-05

Demonstrated that large frozen text encoders (T5-XXL) with cascaded diffusion models produce photorealistic images, outperforming DALL·E 2.

Why It Matters

Proved that language understanding is the key bottleneck for text-to-image quality, shifting the field toward larger text encoders.

DiffusionText-to-Image

Related Models:

Imagen 2 Imagen 3

Builds on:

Alignment

2017 – 2023

Teaching AI to be helpful, harmless, and honest

📄RLHF (Christiano)

Deep Reinforcement Learning from Human Preferences

Alignment

Christiano et al.·OpenAI / DeepMind·2017

Pioneered the RLHF paradigm — training a reward model from human preferences, then using it to fine-tune policies via reinforcement learning.

RLHF

📄Constitutional AI

Constitutional AI: Harmlessness from AI Feedback

Alignment

Bai et al.·Anthropic·2022

Introduced RL from AI Feedback using "constitutions" (rule sets) for self-supervision, reducing reliance on human labels for harmlessness training.

Constitutional AIRLHF

Builds on:

📄InstructGPT

Training Language Models to Follow Instructions with Human Feedback

Alignment

Ouyang et al.·OpenAI·2022-01

Applied RLHF to GPT-3: supervised fine-tuning → reward modeling → PPO optimization. Made models safer, more helpful, and more aligned.

RLHFInstruction Tuning

Builds on:

📄DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Alignment

Rafael Rafailov et al.·Stanford University·2023-05

Showed that preference learning could be formulated as a simple classification problem on pairs of outputs, eliminating the need for a separate reward model and the instabilities of PPO training.

Why It Matters

Simplified RLHF from a complex multi-stage pipeline to a single training step. Adopted by LLaMA 3, Mixtral, Tülu, and most modern open models.

RLHF

Related Models:

Tülu 3

Builds on:

Reasoning

2022 – 2024

Making AI think step by step

📄RAG

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Reasoning

Patrick Lewis et al.·Facebook AI Research / UCL·2020-05

Combined a neural retriever (that searches a knowledge base) with a sequence-to-sequence generator, allowing the model to 'look up' relevant documents before answering — reducing hallucinations and enabling knowledge updates without retraining.

Why It Matters

Invented the RAG pattern now used by Perplexity, enterprise search, and virtually every production LLM system that needs accurate, up-to-date information.

Tool Use

Related Models:

Perplexity SearchGPT

Builds on:

📄Chain-of-Thought

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Reasoning

Wei et al.·Google·2022

Showed that prompting models to "think step-by-step" unlocks arithmetic, logic, and commonsense reasoning in large models like PaLM.

Chain-of-ThoughtReasoning

Related Models:

PaLM PaLM 2 o1 o3

Builds on:

📄ReAct

ReAct: Synergizing Reasoning and Acting in Language Models

Reasoning

Yao et al.·Princeton / Google·2022

Combined chain-of-thought reasoning with external tool use (APIs, search), improving QA and decision-making through interleaved reasoning and action.

Chain-of-ThoughtTool UseAgentic

Related Models:

GPT-4 Claude 3 Gemini 1.0

Builds on:

📄DeepSeek-R1

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Reasoning

DeepSeek-AI·DeepSeek AI·2025-01

Demonstrated that pure RL training (without supervised fine-tuning on reasoning traces) can produce chain-of-thought reasoning, achieving performance comparable to OpenAI o1.

Why It Matters

Showed an alternative path to reasoning capabilities without relying on proprietary training data. The distilled models outperformed many larger models.

ReasoningChain-of-ThoughtTest-Time Compute

Builds on: