If you want to understand how Large Language Models are built — not just how to call an API — you need a map.
There is a lot of ground to cover. Tokenization. Attention mechanisms. Distributed training. Chinchilla scaling laws. Data curation pipelines. RLHF. The field moves fast and the terminology can feel deliberately intimidating.
But here is the thing: every serious LLM — from GPT-4 to Llama to Gemini — is built on the same five foundational pillars. Once you understand the shape of those pillars, the rest of the field starts to make sense.
This post is that map. We will walk through each of the five pillars, explain what sits inside each one, and why it matters. Everything we write about on this blog will connect back to these five areas.
The Five Pillars
The diagram below, from a Stanford lecture on LLM design decisions, organizes every major decision a team makes when building a large language model into five categories:
| Pillar | What it covers |
|---|---|
| Basics | Tokenization, Architecture, Loss function, Optimizer, Learning rate |
| Systems | Kernels, Parallelism, Quantization, Activation checkpointing, CPU offloading, Inference |
| Scaling Laws | Scaling sequence, Model complexity, Loss metric, Parametric form |
| Data | Evaluation, Curation, Transformation, Filtering, Deduplication, Mixing |
| Alignment | Supervised fine-tuning, Reinforcement learning, Preference data, Synthetic data, Verifiers |
Let’s go through each one.
Pillar 1: Basics
The Basics pillar covers the core design decisions you make before you write a single line of training code. Get these wrong and nothing else can save you.
Tokenization
Before a model can learn anything, you need to convert raw text into numbers. A tokenizer splits text into tokens — subword units that the model operates on — and maps each token to an integer ID.
The choice of tokenizer has downstream effects on everything. BPE (Byte Pair Encoding), used by GPT-4 and Llama, and SentencePiece, used by many multilingual models, make different tradeoffs around vocabulary size, how they handle rare words, and how many tokens a given piece of text consumes. A model trained with one tokenizer cannot be swapped to another without retraining from scratch.
Architecture
The transformer architecture, introduced in the 2017 “Attention Is All You Need” paper, is the backbone of virtually every modern LLM. But there is significant variation within it.
Decisions here include:
- Number of layers (depth of the network)
- Attention heads and their dimensionality
- Context window length — how many tokens the model can “see” at once
- Positional encoding — RoPE, ALiBi, or learned positions
- Normalization — Pre-LayerNorm or Post-LayerNorm, which affects training stability
- Activation functions — SiLU and SwiGLU have largely replaced ReLU in modern models
These choices compound. A model with 32 layers and 8 attention heads behaves very differently from one with 80 layers and 64 heads, even at the same parameter count.
Loss Function
Language models are trained to minimize cross-entropy loss over next-token prediction. Given a sequence of tokens, the model tries to predict the next one, and the loss measures how wrong it was.
This sounds simple, but the loss function is the signal that shapes everything the model learns. Every behavior, every fact, every reasoning pattern the model exhibits comes from this signal applied to enough data.
Optimizer
Most large models are trained with AdamW, a variant of Adam with weight decay. The optimizer determines how gradients are used to update model weights.
For large models, optimizer state can consume as much memory as the model itself — Adam stores a first and second moment for every parameter. This is a key constraint when planning training infrastructure.
Learning Rate
The learning rate controls how aggressively the optimizer updates weights. Too high and training diverges; too low and you waste compute.
Modern training runs use a learning rate schedule: a warmup phase where the rate ramps up gradually, followed by a decay (often cosine decay) over the rest of training. Getting this schedule wrong is one of the most common causes of unstable or suboptimal training runs.
Pillar 2: Systems
Knowing the architecture is one thing. Actually training a 70-billion-parameter model on thousands of GPUs is an entirely different engineering challenge. That is what the Systems pillar is about.
Kernels
A kernel is the low-level GPU computation that executes a specific operation. The bottleneck in transformer training is often not the algorithm but the implementation. FlashAttention, for example, is a rewritten attention kernel that computes the same result as standard attention but in a way that is dramatically faster and more memory-efficient by tiling computations to fit in fast SRAM.
Parallelism
A single GPU cannot hold a large model or process enough data fast enough. Training requires distributing work across hundreds or thousands of GPUs using multiple parallelism strategies:
- Data parallelism: Each GPU holds a full copy of the model and processes a different batch. Gradients are synchronized across GPUs after each step.
- Tensor parallelism: Individual weight matrices are split across GPUs. Used for very large models that do not fit on one device.
- Pipeline parallelism: Different layers of the model are distributed across different GPUs, with data flowing through them like an assembly line.
Production training runs often combine all three — this is called 3D parallelism.
Quantization
Quantization reduces the numerical precision of model weights from 32-bit or 16-bit floats to 8-bit integers or lower. This shrinks memory footprint dramatically and speeds up computation.
INT8 and INT4 quantization are standard for inference. Mixed-precision training (BF16 for activations, FP32 for optimizer state) is standard for training. The tradeoff is a small loss in precision — getting the quantization scheme right without degrading model quality is a genuine engineering problem.
Activation Checkpointing
During training, intermediate activations (the outputs of each layer) are normally kept in memory so they can be used during the backward pass to compute gradients. For large models, this consumes enormous amounts of GPU memory.
Activation checkpointing (also called gradient checkpointing) trades compute for memory: instead of storing all activations, you recompute them during the backward pass. This allows training much larger models on the same hardware at the cost of roughly 33% more compute.
CPU Offloading
Another memory optimization: move optimizer state or model weights to CPU RAM when not in use, and pull them back to GPU when needed. Libraries like DeepSpeed ZeRO implement this aggressively, enabling training of models that are far too large to fit entirely on GPU memory.
Inference
Training and inference have different engineering constraints. During inference, you need low latency for interactive applications and high throughput for batch jobs.
Key inference optimizations include KV cache (caching key-value pairs from the attention mechanism to avoid recomputing them on each token generation step), speculative decoding (using a small draft model to generate candidate tokens, then verifying them with the large model in parallel), and continuous batching (dynamically batching multiple requests to maximize GPU utilization).
Pillar 3: Scaling Laws
One of the most powerful — and most misunderstood — ideas in modern AI is that model behavior follows predictable mathematical laws as you scale compute, data, and parameters.
Scaling Sequence
Training a state-of-the-art model is expensive. Before committing tens of millions of dollars to a full training run, labs run smaller experiments to understand how performance will scale. The scaling sequence is the practice of running a series of smaller models to extrapolate behavior at larger scale.
Model Complexity
The key variable in scaling laws is the number of parameters — the learnable weights in the model. But parameter count alone is not enough. You also need to account for how many tokens the model is trained on.
The landmark Chinchilla paper (Hoffmann et al., 2022) showed that most models at the time were significantly undertrained relative to their size. Given a fixed compute budget, you should train a smaller model on more data rather than a larger model on less data. The Chinchilla-optimal ratio is roughly 20 tokens per parameter.
Loss Metric
The primary metric for pretraining is perplexity — a measure of how well the model predicts held-out text. Perplexity decreases predictably as you scale compute, and this decrease follows a power law.
This predictability is what makes scaling laws useful: you can extrapolate from small experiments to estimate how a much larger model will perform before you train it.
Parametric Form
The mathematical form of the scaling law is:
L(N, D) = A / N^α + B / D^β + L_∞
Where L is loss, N is parameters, D is training tokens, and α, β, A, B, L∞ are fitted constants. This formula lets you compute the optimal allocation of parameters and data given a fixed compute budget (measured in FLOPs).
Pillar 4: Data
The Systems pillar gets most of the engineering attention. The Data pillar deserves more.
A model is a compressed representation of its training data. Everything it knows, every capability it has, every bias it exhibits comes from what it was trained on. Arguably, data is the single most important variable in model quality — and also the hardest to get right.
Evaluation
Before you curate data, you need to know what “better” means. Evaluation — defining benchmark tasks and metrics — drives every data decision. Common pretraining evaluations include MMLU (knowledge), HellaSwag (commonsense reasoning), and HumanEval (code). What you measure shapes what you optimize for.
Curation
Curation is the process of selecting which sources to include in your training corpus. Common web-crawl datasets like Common Crawl contain a huge amount of low-quality content — spam, boilerplate, nonsensical machine-translated text.
High-quality curated data (books, academic papers, code, carefully filtered web text) is disproportionately valuable. The FineWeb dataset from Hugging Face, for example, applies aggressive quality filtering to Common Crawl and achieves significantly better downstream performance than using the raw crawl.
Transformation
Raw text from the web is messy. Transformation includes HTML stripping, encoding normalization, language identification, and format standardization. For code data, transformation might include removing auto-generated files or normalizing whitespace.
These transformations sound mundane but have significant impact. Poorly extracted HTML that leaves nav bars, cookie banners, and ads in the training text teaches the model those patterns too.
Filtering
Filtering removes low-quality documents using heuristics and model-based quality classifiers. Common filters include:
- Minimum and maximum document length
- Fraction of text that is alphanumeric
- Number of lines ending with punctuation
- Perplexity scored by a small reference model (low perplexity = high quality)
Aggressive filtering dramatically improves model quality but also reduces dataset size, which brings its own tradeoffs.
Deduplication
The web contains enormous amounts of duplicated content — the same article scraped from dozens of mirrors, boilerplate legal text repeated across millions of pages. Training on duplicates is wasteful at best and harmful at worst: it causes the model to memorize and over-weight repeated content.
Deduplication using MinHash LSH or similar algorithms is now standard practice. The LLaMA and Mistral training pipelines both apply aggressive deduplication.
Mixing
A model trained only on web text will be good at web text. A model trained on a carefully mixed corpus of web text, books, code, and scientific papers will be more capable across domains.
Data mixing — deciding what fraction of training tokens come from each source — is one of the highest-leverage decisions in model development. The exact mixture is typically proprietary, but the principle is well understood: domain-specific data matters, and the mixture should reflect what you want the model to be good at.
Pillar 5: Alignment
Pretraining gives you a model that is very good at one thing: predicting the next token. A pretrained base model is not a product. It will happily complete a prompt asking for instructions on how to do something harmful. It will generate plausible-sounding nonsense with no awareness that it should be accurate.
Alignment is the set of techniques that takes a pretrained base model and shapes it into something useful and safe.
Supervised Fine-Tuning (SFT)
The first step is supervised fine-tuning. You collect a dataset of (prompt, ideal response) pairs, authored or reviewed by humans, and fine-tune the pretrained model to generate those ideal responses.
SFT teaches the model the format and style of helpful responses. After SFT, the model knows to answer questions rather than just continuing them, to follow instructions, and to stay on topic. But SFT alone is not enough to reliably align complex preferences.
Reinforcement Learning from Human Feedback (RLHF)
RLHF is the technique that made ChatGPT feel qualitatively different from GPT-3. The process has three steps:
- Collect comparisons: Show human raters pairs of model outputs and ask which is better.
- Train a reward model: Train a classifier to predict which outputs humans prefer.
- Fine-tune with RL: Use the reward model as a signal to update the language model via reinforcement learning (typically PPO or similar algorithms), encouraging it to generate outputs that score highly.
RLHF is powerful but expensive, brittle, and prone to reward hacking. A lot of active research is focused on making this process more stable and efficient.
Preference Data
The quality and source of human preference data is critical. Who labeled it? What instructions did they have? What biases do they bring?
Constitutional AI (Anthropic) and Direct Preference Optimization (DPO) are recent approaches that reduce reliance on expensive human labeling by using AI-generated critiques and preferences, scored against an explicit set of principles.
Synthetic Data
As the supply of high-quality human-authored data becomes a constraint, synthetic data — data generated by a capable model — is increasingly used for both pretraining and alignment.
Meta’s Llama models were significantly improved using synthetic data generated by earlier Llama models. OpenAI’s o1 was trained on synthetic chain-of-thought traces generated by the model itself. This creates a feedback loop: better models generate better training data, which trains better models.
Verifiers
Verifiers are a critical piece of aligning models in domains where correctness can be checked automatically — math, code, logic. Instead of relying on human preferences, you train the model using a verifier (a separate program or model) that checks whether the output is actually correct.
This is the key insight behind reinforcement learning from verifiable rewards (RLVR): in domains with ground truth, you do not need humans in the loop. The model can explore the solution space, and correct answers provide the training signal.
Why These Five Pillars
The reason we organized this blog around these five pillars is that they reflect the actual structure of the field. Papers, engineering blog posts, and research advances all slot neatly into one of these categories. When you see a new paper about FlashAttention 3, that is a Systems paper. When you see a paper about a new RLHF variant, that is Alignment. When you see a new approach to data filtering, that is Data.
More importantly: if you want to build real AI systems — not just call APIs — you need fluency across all five. An application engineer needs the Basics to understand what is happening inside the model. A production ML engineer needs Systems to deploy efficiently. A team training their own model needs Scaling Laws to spend their compute budget wisely, Data to build a quality corpus, and Alignment to make the result useful.
What’s Next
Each of these pillars deserves a deep dive. Here is where we are headed:
- Basics: A ground-up walkthrough of the transformer — attention, positional encoding, how training actually works
- Systems: FlashAttention, parallelism strategies, how to serve a 70B model for less than you think
- Scaling Laws: The Chinchilla results explained, and what they imply for how you train
- Data: Building a data pipeline for LLM pretraining — from raw web crawl to clean, mixed, deduplicated corpus
- Alignment: SFT datasets, RLHF in practice, and why synthetic data changes the calculus
If you want to follow along as we go deep on each pillar, subscribe below.
References
- Vaswani, A. et al. “Attention Is All You Need.” NeurIPS, 2017.
- Hoffmann, J. et al. “Training Compute-Optimal Large Language Models (Chinchilla).” DeepMind, 2022.
- Ouyang, L. et al. “Training language models to follow instructions with human feedback (InstructGPT).” OpenAI, 2022.
- Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” Anthropic, 2022.
- Rafailov, R. et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” Stanford, 2023.
- Dao, T. et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” 2022.
- Penedo, G. et al. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.” Hugging Face, 2024.