Research Papers
The science behind AI — from backpropagation to transformers, scaling laws to reasoning. Every breakthrough paper that shaped the models we use today.
Pre-Transformer
1986 – 2016The foundations — neural networks, memory, and representation
Learning representations by back-propagating errors
Introduced backpropagation through time for training recurrent neural networks, enabling sequential data processing.
Long Short-Term Memory
Proposed gating mechanisms (input, forget, output gates) to solve the vanishing gradient problem in RNNs, enabling learning over long sequences.
Efficient Estimation of Word Representations in Vector Space
Developed efficient word embeddings via skip-gram and CBOW architectures, capturing semantic relationships in dense vector representations.
Sequence to Sequence Learning with Neural Networks
Introduced encoder-decoder RNN architectures for machine translation, establishing the pattern for generative sequence-to-sequence tasks.
Neural Machine Translation by Jointly Learning to Align and Translate
Added attention to seq2seq models, allowing dynamic focus on relevant input parts rather than compressing everything into a fixed vector.
Generative Adversarial Networks
Introduced generative adversarial networks — two neural networks competing against each other (a generator creates fake data, a discriminator tries to detect fakes) which produces increasingly realistic outputs.
Invented the generative adversarial framework that powered the first wave of AI image generation and inspired all subsequent generative models.
Transformer
2017 – 2020The architecture that changed everything
Attention Is All You Need
Introduced the Transformer architecture using self-attention mechanisms, replacing RNNs entirely. Enabled parallel training and superior long-range dependency modeling.
Improving Language Understanding by Generative Pre-Training
First decoder-only Transformer pretrained generatively on BooksCorpus. Demonstrated zero-shot transfer learning via fine-tuning.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Encoder-only bidirectional pretraining with masked language modeling (MLM) and next-sentence prediction. Set SOTA on GLUE benchmarks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Framed all NLP tasks as text-to-text problems. Scaled pretraining and fine-tuning systematically across tasks.
Language Models are Unsupervised Multitask Learners
Scaled GPT to 1.5B parameters on WebText. Demonstrated emergent unsupervised multitask learning without task-specific fine-tuning.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Showed BERT was significantly undertrained; by optimizing training procedure (more data, longer training, no NSP), achieved much better results.
Demonstrated that training methodology matters as much as architecture. Influenced all subsequent pretraining optimization research.
Language Models are Few-Shot Learners
175B-parameter GPT. Pioneered few-shot and in-context learning, dramatically reducing the need for fine-tuning.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Applied the Transformer architecture directly to images by splitting them into patches (16x16 pixel squares) and treating each patch like a word token, eliminating the need for convolutional neural networks.
Brought Transformers to computer vision. Every modern multimodal model (GPT-4V, Gemini, Claude Vision) descends from this insight.
Learning Transferable Visual Models From Natural Language Supervision
Trained a model to understand both images and text by learning which image-text pairs go together from 400 million internet examples. This created a shared 'embedding space' where images and text can be directly compared.
Connected the world of images and text. Powers image search, DALL·E's text understanding, Stable Diffusion's guidance, and most multimodal AI systems.
Robust Speech Recognition via Large-Scale Weak Supervision
Trained a speech recognition model on 680,000 hours of multilingual audio from the internet, achieving near-human accuracy across 97 languages without any task-specific fine-tuning.
Made accurate speech-to-text accessible to everyone as an open model. Used in podcasting, accessibility, real-time translation, and meeting transcription worldwide.
Scaling
2020 – 2023How big should models be? The laws of scale
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Introduced sparsely-gated Mixture-of-Experts layers for scaling model capacity without proportional compute increase.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Pioneered efficient model parallelism techniques enabling training of multi-billion parameter Transformers across GPUs.
Established the blueprint for distributed training at scale. Megatron's parallelism strategies underpin most large-model training frameworks today.
Scaling Laws for Neural Language Models
Found that model performance follows power laws in compute, parameters, and data. Provided the mathematical framework for scaling decisions.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.
LoRA: Low-Rank Adaptation of Large Language Models
Proposed freezing the original model weights and injecting small trainable low-rank matrices, reducing fine-tuning memory by 10,000x while maintaining quality.
Democratized AI fine-tuning — made it possible to customize billion-parameter models on consumer GPUs. Used in virtually every open-source model adaptation today.
Training Compute-Optimal Large Language Models
Challenged Kaplan's scaling laws by showing data should scale equally to parameters. 70B Chinchilla outperformed 280B Gopher.
LLaMA: Open and Efficient Foundation Language Models
Showed that smaller models trained on significantly more data (following Chinchilla scaling laws) could match or exceed the performance of much larger models, and released the weights openly.
Kicked off the open-source LLM revolution. LLaMA's leak and subsequent open release spawned Alpaca, Vicuna, and hundreds of community models — democratizing access to frontier AI.
GPT-4 Technical Report
Described GPT-4's multimodal capabilities and performance across professional/academic benchmarks, setting new SOTA on bar exam, MMLU, and many others.
First model to convincingly pass professional exams and demonstrate broad multimodal reasoning, catalyzing widespread enterprise AI adoption.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Provided the most detailed public documentation of how to train, fine-tune, and safety-align a large language model, including their full RLHF methodology.
The 'recipe book' for the open-source LLM community. Its detailed training methodology was copied by virtually every open model that followed.
Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.
Proved that architectural innovations matter more than raw scale. Kickstarted Mistral AI as a major open-weight competitor.
Gemini: A Family of Highly Capable Multimodal Models
Introduced the Gemini family with native multimodal training from the ground up, achieving SOTA on 30+ benchmarks.
First model family natively trained on interleaved text, image, audio, and video from inception, rather than bolting on vision post hoc.
Architecture
2020 – 2024New ways to build: efficiency and alternatives
GLU Variants Improve Transformer
Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.
SwiGLU became the default FFN activation in PaLM, LLaMA, Mistral, and most modern Transformers, replacing ReLU and GELU.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond single-device memory limits.
Blueprint for scaling MoE to production. Directly influenced Mixtral, DeepSeek V2/V3, and all modern MoE models.
RoFormer: Enhanced Transformer with Rotary Position Embedding
Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every modern LLM.
RoPE replaced absolute and learned position embeddings as the standard. LLaMA, Qwen, Mistral, and most modern models use it.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Restructured how attention computation accesses GPU memory (tiling and recomputation), achieving 2-4x speedup and enabling much longer context windows without approximation.
Made million-token context windows practical. Used in virtually every modern LLM — without Flash Attention, models like GPT-4 and Claude would be dramatically slower.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Pioneered post-training quantization to 4-bit for large language models with minimal quality loss, enabling consumer GPU inference.
Democratized LLM inference by making it possible to run 30B+ parameter models on a single consumer GPU. Widely adopted by the open-source community.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality.
GQA is now standard in LLaMA 2+, Mistral, Gemma, and nearly all efficient modern LLMs.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Introduced selective state space models that process sequences in linear time (vs. quadratic for Transformers), with a data-dependent selection mechanism that lets the model focus on relevant parts of the input.
First serious architectural challenger to the Transformer. Inspired hybrid Mamba-Transformer models like Jamba and NVIDIA Nemotron 3.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory needed to serve long-context models.
MLA became the most important attention efficiency innovation since Flash Attention. Adopted by DeepSeek-V3/R1 and influenced the entire industry's approach to efficient inference.
Diffusion
2020 – 2023From noise to art: the generative revolution
Denoising Diffusion Probabilistic Models
Showed that gradually adding noise to data and then learning to reverse the process could generate images rivaling GANs, with more stable training and better diversity.
Foundation of Stable Diffusion, DALL·E 2, and Imagen — replaced GANs as the dominant image generation paradigm.
Zero-Shot Text-to-Image Generation
Demonstrated that a single model could generate diverse, creative images from arbitrary text descriptions, combining language understanding with image generation.
First major text-to-image model — proved AI could be genuinely creative, not just analytical.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Demonstrated that large frozen text encoders (T5-XXL) with cascaded diffusion models produce photorealistic images, outperforming DALL·E 2.
Proved that language understanding is the key bottleneck for text-to-image quality, shifting the field toward larger text encoders.
Alignment
2017 – 2023Teaching AI to be helpful, harmless, and honest
Deep Reinforcement Learning from Human Preferences
Pioneered the RLHF paradigm — training a reward model from human preferences, then using it to fine-tune policies via reinforcement learning.
Constitutional AI: Harmlessness from AI Feedback
Introduced RL from AI Feedback using "constitutions" (rule sets) for self-supervision, reducing reliance on human labels for harmlessness training.
Training Language Models to Follow Instructions with Human Feedback
Applied RLHF to GPT-3: supervised fine-tuning → reward modeling → PPO optimization. Made models safer, more helpful, and more aligned.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Showed that preference learning could be formulated as a simple classification problem on pairs of outputs, eliminating the need for a separate reward model and the instabilities of PPO training.
Simplified RLHF from a complex multi-stage pipeline to a single training step. Adopted by LLaMA 3, Mixtral, Tülu, and most modern open models.
Reasoning
2022 – 2024Making AI think step by step
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Combined a neural retriever (that searches a knowledge base) with a sequence-to-sequence generator, allowing the model to 'look up' relevant documents before answering — reducing hallucinations and enabling knowledge updates without retraining.
Invented the RAG pattern now used by Perplexity, enterprise search, and virtually every production LLM system that needs accurate, up-to-date information.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Showed that prompting models to "think step-by-step" unlocks arithmetic, logic, and commonsense reasoning in large models like PaLM.
ReAct: Synergizing Reasoning and Acting in Language Models
Combined chain-of-thought reasoning with external tool use (APIs, search), improving QA and decision-making through interleaved reasoning and action.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Demonstrated that pure RL training (without supervised fine-tuning on reasoning traces) can produce chain-of-thought reasoning, achieving performance comparable to OpenAI o1.
Showed an alternative path to reasoning capabilities without relying on proprietary training data. The distilled models outperformed many larger models.