A Plain-Language Manual · 40 Pages
AI Models.
How language, image, and video models work, taken apart one mechanism at a time.
Page 02/Intro
Two stacks. One substrate.
Seven familiar products. Two ways to decode. One shared operation doing the work in the middle.
LANGUAGE GENERATION ──────── ────────── tokens latent patches ▼ ▼ embed VAE encode ▼ ▼ ┌──────────┐ ┌──────────┐ │transformer│ │transformer│ │decoder×80│ │ DiT × 28 │ │causal │ │ full │ └──────────┘ └──────────┘ ▼ ▼ unembed VAE decode ▼ ▼ next token image/video sample · repeat final frame ──── attention(Q,K,V) ──── identical · every layer · every block
Two stacks · one engine · the substrate is attention
Start with seven products you have probably used. On the language side: ChatGPT, Gemini, Claude. On the image and video side: Midjourney, Veo, Sora, Flux. They feel like different kinds of software. Underneath, they run the same core operation. It is called attention, and the next four pages take it apart properly. For now one sentence is enough: attention lets every element in a sequence look at every other element at the same time, and update itself based on what it finds.
The two families differ in what goes in and what comes out, not in the engine. A language model takes in fragments of words and builds its answer one fragment at a time. An image or video model takes in random noise and removes that noise, in steps, until a picture is left. Different inputs and different outputs, with the same attention operation doing the work between them.
This manual is organized around that shared middle: thirteen pages on the language stack, thirteen on image and video, five on what they share. Read in order, or jump via the dock.
Text and pixels need different decoders. The operation between them is the same.
Page 03/Intro
Why · what. Now · how.
Every concept page follows the same four-beat shape, in the same order.
¶1 · why
The hook.
The problem this concept exists to solve.
"Pixel-space diffusion is prohibitively expensive."
¶2 · what
The mechanism.
Stated in one paragraph.
"A VAE compresses 1024² to 64²; diffusion runs there."
¶3 · now
The consequence.
Where it shows up in products you've used.
"Stable Diffusion, Flux, SD3, Veo, Sora — all use this."
─ · how
The takeaway.
Layperson model + the common wrong claim.
"Sculpt the storyboard. The VAE isn't lossless."
Four beats · the shape every concept page follows
Every concept page is built on the same four beats, in the same order. First, why the idea exists: the problem it was invented to solve. Second, what it actually does, stated as a mechanism in one paragraph. Third, where it already shows up in products you have used. Fourth, the short version most people get wrong, named and then corrected.
The layout supports those beats. The figure on the left is the anchor: an equation, a worked example, a comparison table, or an annotated flow. The prose on the right narrates what the figure shows, one step at a time. The single line at the bottom is the one sentence to keep if you keep nothing else from the page.
Navigation is keyboard-first. The arrow keys or the space bar move forward and back. The number keys 1 through 9 jump to the first nine pages. Home and End jump to the first page and the last. The dock along the bottom groups every page by section.
One concept per page. One line to keep.
Page 04/Intro
The transformer.
One block — attention plus a small network — stacked dozens deep. The engine inside every model here.
INPUT sequence of tokens / patches [t0] [t1] [t2] [t3] ... [tN] [embeddings] ┌──────────────────────┐ │ BLOCK × N │ │ ↓ self-attention │ │ ↓ + residual │ │ ↓ MLP (per token) │ │ ↓ + residual + norm │ │ N = 32 to 120 │ └──────────┬───────────┘ [output head] OUTPUT same-shape sequence → next-token (LLM) → noise (diffusion) → fingerprint (encoder) SAME ENGINE, DIFFERENT TOKENS LLMs text tokens CLIP / SigLIP image patches DiT / MM-DiT latent patches Sora spacetime patches
One engine · different tokens · the middle of every model in this deck
A transformer is a single repeating part, stacked. The part is called a block, and every block has the same structure. Inside one block, two things happen in order. First a self-attention step, where each element in the sequence looks at every other element. Then a feed-forward step, called an MLP (a small network applied to each position on its own). Both steps are wrapped in residual connections, which means a step's input is added back onto its output. Because of that addition, each block edits the running representation rather than overwriting it. Stack 32 to 120 of these blocks and you have a transformer.
What flows through the stack is a sequence of vectors — lists of numbers, one per element, where an element can be a word fragment, an image patch, or a video patch. The output is a sequence of the same shape, which a decoder then reads in whatever way the product needs: as a next-token distribution, as noise to remove, or as a single summary vector.
So what changes from product to product is the tokenization that feeds the engine and the decoder that reads its output, not the engine itself. The middle stays the same. The next three pages open that middle up: first the attention operation itself, then the block it lives in, then the three architectural shapes the family splits into.
Change what you feed in and what reads the output, and the same engine becomes a different product.
Section 01 · The Substrate · 4 pages
attention.
The operation under everything. Language and generation share this mechanism before they split into different decoders.
Page 06/Attention
Q · K · V.
Query, Key, Value. Every element asks every other element a question, all at the same time.
SEQUENCE [the] [cat] [sat] [on] [the] [mat] ↓ ATTENTION PASS ↓ SCORES FOR "cat" SOFTMAX cat → mat = 2.1 0.62 ██████ cat → sat = 0.9 0.19 ███ cat → on = 0.4 0.10 ██ cat → the = 0.2 0.06 █ cat → cat = 0.1 0.03 ▏ UPDATE (weighted sum of Values) new(cat) = 0.62 · Vmat + 0.19 · Vsat + 0.10 · Von + 0.06 · Vthe + 0.03 · Vcat all 6 tokens update in parallel
One attention head · the all-pairs operation in one slide
The popular picture — the model reads left to right and predicts the next word — only describes the output. Inside, at every layer, every position updates itself by looking at every other position at once: one parallel operation across the whole sequence.
Each token is turned into three vectors (lists of numbers), each with a job:
Query — the kind of token I am looking for. cat might look for verbs and locations.
Key — what kind of token I am, advertised so others can find me. mat advertises "location."
Value — what I will contribute to your update if you attend to me.
Each Query is compared against every Key by a dot product (a similarity score). Softmax turns those scores into percentages summing to 100 — the attention weights. The token then rewrites itself as that weighted blend of every Value. All tokens do this at once; one such pass is a single head, and a block runs several heads in parallel, each specializing in a different relation. "The model pays attention to the important words" is not wrong, just vague: the real content is the weight table in the figure.
Every token rewrites itself from a weighted blend of every other token — at every layer, all at once.
Page 07/Attention
Stacked.
The same block, repeated in series. Depth times heads is where composition comes from.
TRANSFORMER BLOCK input ────┐ │ ▼ ┌─────────────┐ │ attention │ ← all pairs · multi-head └──────┬──────┘ │ + residual ▼ ┌─────────────┐ │ MLP │ ← per-position non-linearity └──────┬──────┘ │ + residual + layernorm ▼ output ────┘ (same shape as input) REPEAT × N N = 32 (small) → 80 (Llama-70B) → 120+ (frontier)
One block · the same two steps · stacked N times
A single block does two things, both introduced on the previous page. First an attention step, where every position reads every other position. Then a feed-forward MLP step, applied to each position on its own. A residual connection wraps each step, which means the step's input is added back to its output, so the block adjusts the running representation instead of replacing it. Frontier models stack 32 to 120 of these blocks.
Follow one token's representation up the stack to see why depth matters. The first block runs attention, so the token absorbs a weighted blend of every other token. Then the MLP reshapes that token on its own. The result is handed to the second block, which runs the same two steps again — but now on representations that already carry context from the first pass. Repeat for every block. By the top of a 100-block stack, each token has been informed by every other token dozens of times over, each pass working at a slightly more abstract level than the last.
What ends up living at each depth is not designed. It emerges from training. In practice, early blocks tend to resolve local grammar, middle blocks build meaning and relationships, and late blocks assemble the final prediction. Nobody assigns those roles; they emerge from the same next-token objective.
Each block reworks the sequence using everything the block below it already worked out.
Page 08/Attention
Three shapes.
Encoder · Decoder · Encoder–Decoder. All modern chat LLMs are decoder-only.
Encoder-only
BERT · CLIP text · SigLIP. Reads the whole sequence at once, bidirectional. Outputs one fingerprint. Used for classification and embeddings.
Decoder-only
GPT · LLaMA · Gemini · Claude. Reads left-to-right, predicting the next token; a causal mask blocks it from seeing ahead. Every modern chat LLM lives here.
Encoder–Decoder
T5 · original Transformer · NMT models. Reads input, then writes output. Used historically for translation. Mostly legacy.
READ PATTERN
encoder-only [tok][tok][tok][tok] ⟷ all pairs, bidirectional
decoder-only [tok][tok][tok][___] ← causal mask, left-only
enc-dec [tok][tok][tok] → [out][out][out] cross-attend
The family tree · three branches · the chat models you use are all decoders
All three architectures use the same attention operation. What separates them is what each one is allowed to read and what it produces. An encoder reads the whole sequence at once, with every position free to look both left and right, and outputs a single summary vector, a fingerprint of the input. A decoder reads strictly left to right and writes one token at a time, each new token predicted only from the tokens before it. An encoder-decoder does both in sequence: it encodes the input, then writes a fresh output from that encoding.
Each shape maps onto products you know. Every chat model in 2026 — GPT, Llama, Gemini, Claude — is decoder-only, because holding a conversation is exactly that left-to-right writing task. Every image retriever, such as CLIP or SigLIP, is encoder-only, because retrieval needs one fingerprint per item to compare.
One misconception is worth naming. People often call GPT an encoder-decoder model. It is not — a decoder-only model has no separate reading stage; it simply continues the sequence it is handed. The encoder-decoder shape survives mostly in older machine-translation systems.
Encoders read the whole input into one fingerprint. Decoders write one token at a time. Every chat model is a decoder.
Section 02 · Part A · 13 pages
language.
Decoder-only transformers, trained to predict the next token. Pretraining is simple and does most of the work. Post-training is where the personality comes from.
Page 10/Language
Tokens.
The model's alphabet. Subword chunks chosen by frequency.
PHRASE TOKEN SPLIT VOCAB IDS "unbelievable" [un] [believ] [able] 359 · 40471 · 481 "ChatGPT works" [Chat] [G] [PT] [ works] 16047 · 38 · 2898 · 4375 "résumé.pdf" [r] [és] [um] [é] [.pdf] 81 · 7206 · 372 · 978 · 14329 "New York-based" [New] [ York] [-] [based] 3648 · 4356 · 12 · 3100 "你好" [UTF-8 byte] [UTF-8 byte] 224 · 121 · 224 · 165 "🎬" [UTF-8 byte] × 4 240 · 159 · 142 · 172 PRICING SHAPE 1 English word ≈ 1.3 tokens 1 Chinese character ≈ 2 tokens (byte-level fallback) 1 emoji ≈ 4 tokens 1 long URL ≈ 1 token per chunk between / · - _ Same text · different tokenizer · different split, IDs, and cost.
Real tokenization · 6 examples · the bill is in tokens, not words
A model never sees letters or whole words. Before any text reaches the network, a tokenizer chops it into tokens. A token is a chunk of text: often a common word, often a fragment of a rarer one. The tools that do the chopping have names you will meet in code — BPE, SentencePiece, tiktoken — and each is a recipe for deciding which chunks are frequent enough to deserve their own slot. Common words usually become one token. Rare words break into several pieces. Each token is then converted into an integer ID, which is a position in a fixed vocabulary of roughly 30,000 to 200,000 entries.
The token is also the unit of money and memory. You are billed per token. The context window — the amount of text a model can hold in view at once — is measured in tokens. The cost of running the model scales with tokens. So every question about "how much fits in the context window" is really a question about token counts, not word counts.
The trap is to treat one token as one word. They are not the same. The string "GPT" is one token in one tokenizer, three in another, and four if the tokenizer falls all the way back to raw bytes. When the exact count matters — for a cost estimate, or for fitting a long document — run the real tokenizer instead of guessing from the word count.
The model's alphabet is tokens, not words. Your bill is counted in them too.
Page 11/Language
Embeddings.
A map of meaning, learned from co-occurrence.
2D PROJECTION OF A HIGH-DIM EMBEDDING SPACE y 4 │ queen(-0.2, 3.8) king(0.4, 3.7) 3 │ woman(-1.4, 2.9) man(1.0, 2.8) 2 │ 1 │ nurse(-2.0, 1.1) doctor(1.7, 1.2) 0 ┼────────────────────────────────────────────────── x -3 -2 -1 0 1 2 COSINE SIMILARITY (the standard metric) cos(king, queen) = 0.82 ██████████████████ near cos(king, man) = 0.79 █████████████████ near cos(king, doctor) = 0.41 █████████ middling cos(king, banana) = 0.07 █ unrelated VECTOR ARITHMETIC king − man + woman ≈ queen (the famous toy) Paris − France + Italy ≈ Rome (works for many relations) Neighbors are geometric, not dictionary lookups.
Embedding geometry · learned from co-occurrence · queried via cosine
Recall that each token is an integer ID. The embedding table turns that ID into a vector. The table holds one row per token in the vocabulary, and each row is a list of 768 to 4096 numbers. Looking a token up means reading off its row. That row of numbers is what actually enters the transformer; the integer ID was only an address pointing at it.
Those rows begin as random numbers. During pretraining they are nudged, again and again, so that tokens used in similar contexts end up close together in this space of numbers. Treat each row as coordinates: after training, king and queen land near each other, and far from banana. Nothing in a vector "knows" what a king is; the closeness is only the residue of training. You read it with cosine similarity, the angle between two vectors — a small angle means similar, a near-right angle unrelated.
This single geometry does a great deal of downstream work. It powers retrieval (the R in RAG, Page 20), clustering (grouping similar items, as in the taste-graph), and cross-modal alignment: CLIP and SigLIP are trained so that an image and its caption land near each other in one shared space. Firth put it more plainly in 1957: you shall know a word by the company it keeps.
Meaning here is a location in space: learned from what a word co-occurs with, measured by the angle between vectors.
Page 12/Language
The stack.
Decoder-only, causal mask, residual. One block, repeated.
INPUT TOKENS (each one can only see tokens to its left) [The] [cat] [sat] [on] [the] [___] [embed] [block 01] attn + MLP [block 02] attn + MLP ··· [block 48] attn + MLP [unembed] NEXT-TOKEN DISTRIBUTION · over ~50,000 vocab entries [mat] █████████████ 0.41 [floor] █████ 0.16 [couch] ███ 0.10 [table] ██ 0.06 [.] █ 0.04 [grass] ▏ 0.03 ... 49,994 more, near zero → sample → append → repeat
One token in, one distribution out · sample, append, repeat
A decoder-only language model is the block from Page 07 with exactly one rule added. When attention runs, each token may look only at tokens that came before it, never at tokens ahead. That restriction is the causal mask. It is what forces the model to work left to right, and it is the only structural difference between this stack and the generic transformer.
A single forward pass runs end to end. The input tokens are embedded into vectors. Those vectors travel up through every block. At the top, the representation of the last position is projected back out into vocabulary space — turned into one raw score for every possible next token. A softmax converts those scores into a probability distribution. The model samples one token from that distribution, appends it to the sequence, and runs the whole pass again. Embed, climb the stack, project, sample, append, repeat.
The model never plans the whole sentence in advance. It commits to one token at a time, then reconsiders everything to choose the next. Reasoning models (o1, Gemini Thinking, Claude's extended thinking) appear to deliberate, and they do generate long chains of intermediate reasoning before the final answer. But that reasoning is itself produced one left-to-right token at a time. The mechanism underneath does not change.
The same block, dozens of times, with one rule added: each token may read only what came before it.
Page 13/Language
Predict the next.
The entire pretraining objective. The document is the label.
TRAINING ROWS — one document, every position a label prefix correct next P(model) loss ───────────────────────────────────────────────────────────────────── "The cat sat on the" "mat" 0.41 0.89 "Paris is the capital of" "France" 0.72 0.33 "for i in range(" "10" 0.18 1.71 "She opened the door and" "saw" 0.09 2.41 "def fibonacci(n):" "\n" 0.81 0.21 "The patient presents with" "shortness" 0.04 3.22 loss = −log P(correct next token) SCALE ~15 trillion tokens one shard of training data ~1 epoch frontier models barely see data twice no human "this is correct" labels the document IS the label loss falls predictably see Page 14 (scaling laws)
Cross-entropy on the next token · no human labels · trillion-token scale
The whole of pretraining is one task, repeated across trillions of tokens. Show the model a prefix of text, ask it for a probability distribution over the next token, then compare its answer against the token that actually came next. The penalty for being wrong is the cross-entropy loss: the less probability it placed on the correct token, the larger the penalty. No human writes an answer key — the document is its own, because the next word is whatever the author wrote.
A narrow task, but it forces broad ability. To predict the next token well, the model has to absorb whatever makes text predictable: grammar, meaning, facts, the rules of code, the steps of arithmetic, the shape of a dialogue. None of it is taught directly; each emerges because it is the cheapest way to lower the loss.
This same objective is where hallucination begins. The model is rewarded for a plausible continuation, not a true one — and plausible and true often diverge. A confident, well-formed, false statement still scores well in training, as long as it resembles something the data might contain. The model did not memorize the internet; it learned the statistical structure of it, and structure can be convincingly faked.
Broad knowledge is a side effect of one narrow drill: predict the next token well enough, often enough.
Page 14/Language
Scaling laws.
Loss falls predictably with compute, params, and data.
LOG LOSS vs LOG COMPUTE (Kaplan 2020 · Hoffmann 2022) loss 3.2 │ * GPT-2 (2019) 2.8 │ * GPT-3 (2020, undertrained) 2.4 │ * Chinchilla 70B (compute-optimal) 2.0 │ * LLaMA-2 70B 1.6 │ * GPT-4 class 1.2 │ * ? └───────────────────────────────────────────── 10²¹ 10²² 10²³ 10²⁴ 10²⁵ FLOPs CHINCHILLA RULE compute-optimal training: ~20 tokens per parameter 70B model → 1.4T tokens; not 300B (which is what GPT-3 did) THE FORECAST loss ∝ (compute)−α α ≈ 0.07 for language smooth · predictable · not automatic AGI
Power law · forecast with error bars · the curve frontier labs spend hundreds of millions to ride
Two papers established the result. Kaplan et al. 2020 (arXiv:2001.08361) and Hoffmann et al. 2022 (arXiv:2203.15556, the Chinchilla paper) showed that training loss falls predictably as you increase three things: compute (the arithmetic spent training, in FLOPs), parameters (the model's adjustable weights), and training data. The relationship is a power law: plot loss against compute on logarithmic axes and the points fall along a straight line. Chinchilla added the correction in the figure — for a fixed compute budget, a smaller model trained on more data wins (~20 tokens per parameter).
The practical payoff is foresight. You do not have to guess whether a model ten times larger will be better. You fit the curve on smaller runs and extrapolate it. This is why a lab will commit a hundred million dollars to a single training run with confidence: the outcome is a forecast with error bars, not a gamble.
One caution keeps this honest. The laws predict loss, not capability. Loss falls smoothly, but specific abilities sometimes appear in jumps the smooth curve never foreshadowed. So the claim that "scaling laws mean AGI by 2027" reads more into the curve than it actually says. The curve forecasts a number going down, not a particular skill arriving on a date.
Spend more compute and the loss falls along a curve you can predict before you spend it.
Page 15/Language
Sampling.
Temperature, top-p, top-k. How a distribution becomes a word.
NEXT-TOKEN DISTRIBUTION AT THREE TEMPERATURES prompt: "The cat sat on the ___" T = 0.0 deterministic · always pick the argmax mat ██████████████████████████████ 1.00 floor . couch . T = 0.7 sharper · the chat default mat ██████████████████ 0.58 floor ██████ 0.22 couch ████ 0.12 rug ██ 0.08 T = 1.5 flatter · "creative" mat █████████ 0.31 floor ███████ 0.24 couch █████ 0.18 rug ████ 0.14 top-p (nucleus): keep smallest set where cumulative P ≥ 0.9 top-k: keep top k candidates · cruder than top-p
Three temperatures · same model · three different collaborators
Page 12 left the model holding a probability distribution over the entire vocabulary at each step. Sampling is the rule that turns that distribution into one chosen token. Three knobs shape the choice.
Temperature rescales the raw scores (the logits) before the softmax turns them into probabilities. At temperature zero the model is deterministic: it always takes the single highest-probability token, the argmax. At temperature one it samples honestly from the distribution as the model reported it. Above one, the distribution flattens toward uniform, so unlikely tokens get a real chance. Top-p (nucleus sampling; Holtzman et al. 2019, arXiv:1904.09751) instead keeps the smallest set of top tokens whose probabilities sum to p, then samples within it. Top-k is the blunter cousin — keep the k most likely.
Sampling is the largest behavioral lever you have without retraining anything. The same model at temperature 0.2 and at temperature 1.2 can feel like two different collaborators. CoWriter's peer-pushback voice is part system prompt, part temperature. The figure makes one warning visible: higher temperature is more random, which is not the same as more creative. As temperature rises, so does the rate of hallucination, because you are deliberately giving low-probability, often wrong, tokens more room to win.
Temperature zero always picks the most likely next token. Temperature one rolls dice weighted by the model's probabilities.
Page 16/Language·Keystone
Hallucination.
What an ungrounded next-token predictor does by default.
OBJECTIVE MISMATCH input claim type plausible? true? outcome ────────────────────────────────────────────────────────────────── real fact yes yes useful answer urban myth · stereotype yes no hallucination fabricated citation yes no fake authority plausible URL yes no dead link "I don't know" no true often under-rewarded retrieved fact + cite yes grounded safer answer WHY IT HAPPENS training objective → rewards plausibility production needs → reward truth the gap → hallucination MITIGATIONS (architectural · not training) retrieval (RAG) → ground in fetched context tool use → ground in execution results citations / structured → ground in source spans verifier models → filter the worst cases
The framing keystone · everything downstream of here is grounding
Put Page 13's lesson in one line and keep it in view: the model is doing exactly what it was trained to do. Training rewards a plausible next token. The person reading the output wants a true one. Hallucination is not a malfunction; it is the name for the distance between those two targets. A model with no access to a source of truth has no way to prefer the true completion over the merely plausible one.
Grounding is the general fix. It means giving the model an external source of truth to lean on at the moment it answers: text retrieved from a database, the result of a tool it ran, a document quoted with citations, a structured record. Without grounding, the model is a fluent autocompleter running past the edge of what it reliably knows. With grounding, its job changes — from inventing a plausible answer to composing one over context it has actually been handed.
This is why every shipping AI product has a grounding layer of some kind. Mechanically that layer is usually RAG (Page 20); in the interface it usually appears as citations. A bigger model lowers the hallucination rate but never removes it: the training objective underneath has not changed. The durable fix is architectural, not a question of scale.
Past the edge of what it knows, the model still writes fluent sentences. Fluency was never knowledge.
Page 17/Language
SFT.
Supervised fine-tuning. Thousands of hand-written (instruction → response) pairs.
PRETRAIN raw web text → predict next token no instructions · no answer key besides "what comes next" model can continue text · cannot follow requests SFT DATASET ┌─ INSTRUCTION ─────────────────────────────────────────┐ │ Explain backprop to a high-schooler in 3 sentences. │ ├─ IDEAL RESPONSE (hand-written, expert-reviewed) ─────┤ │ Start with a guess. Measure how wrong it is. │ │ Push every knob a tiny bit in the direction │ │ that would reduce the error next time. Repeat. │ └───────────────────────────────────────────────────────┘ × ~50,000 to ~1M pairs (lawyers · doctors · coders · domain experts) same next-token loss · different data distribution → model can now answer the asked question
Direct human labor at scale · the first step of post-training
A freshly pretrained model is fluent but not yet helpful. Hand it a question and it is as likely to continue or rephrase it as to answer — continuing text is all it has ever been trained to do. Supervised fine-tuning, or SFT, closes that gap with examples: people write thousands of paired items, an instruction and its ideal response, and the model trains on those pairs until answering, rather than continuing, becomes natural.
Who writes the pairs depends on the domain. General instructions can come from contractors. Specialized ones need experts — lawyers drafting legal questions and answers, doctors writing clinical examples. The training itself uses the very same next-token loss from pretraining; only the data has changed, from raw web text to this curated set of instruction-and-response pairs. (Fine-tuning simply means continuing to train an already-trained model on new data.)
SFT is step one of every commercial model's post-training stack, and the personality you experience starts to form here. A product like CoWriter does not perform SFT — that happens inside a frontier lab on thousands of GPUs. CoWriter works at a different layer: prompt-level steering on top of an already fine-tuned model.
Pretraining teaches the model to speak. SFT teaches it to answer.
Page 18/Language
RLHF · DPO.
The taste layer. Humans rank pairs of outputs; the model learns what got picked.
PROMPT "Explain quantum tunneling to a curious adult." TWO MODEL OUTPUTS A concise · accurate · grounded analogy human picks ✓ B verbose · shaky metaphor · drifts into politics human rejects RLHF PATH (Christiano 2017) (A > B) pairs → reward model → PPO update → KL penalty learns to predict policy gradient stay near SFT DPO PATH (Rafailov 2023) (A > B) pairs ─────────────→ direct preference loss → policy update skip the reward model same signal · same data · two ways to wire it up chosen response should become more likely than rejected response
Two paths from human preferences to model behavior · DPO is increasingly the default
Christiano et al. 2017 (arXiv:1706.03741) first scaled this idea: show human raters two of the model's outputs and ask which is better — thousands of raters, millions of comparisons. A separate network, the reward model, is trained to predict which output a human would prefer. The language model is then adjusted to produce outputs the reward model scores highly. That adjustment uses reinforcement learning (typically PPO), held in check by a KL penalty that punishes drifting too far from the SFT starting point — so it improves on preferences without forgetting how to write.
Rafailov et al. 2023 (arXiv:2305.18290) showed you can reach the same place more directly. Their method, DPO, uses the identical preference data — the same pairs of chosen-over-rejected outputs — but skips building a separate reward model, optimizing the language model straight from the comparisons. Cheaper, simpler, and increasingly the default at smaller labs.
This stage, not pretraining, is where "Claude's personality" or "GPT's voice" lives. The base model speaks in the averaged voice of its training text; post-training is what gives the assistant a consistent, recognizable character. It also carries a bias by construction: the model is tuned toward whichever rater pool produced the rankings — whose preferences count is itself an alignment decision.
Humans rank pairs of answers. The model is tuned toward whichever one they keep choosing.
Page 19/Language
Constitutional.
The model critiques itself against written principles.
DRAFT "Here is a direct, risky answer to the user's question..." CONSTITUTION CHECK (the model self-grades) [helpful] yes [honest] partly · overstates certainty [harmless] fail · enables misuse of the technique SELF-CRITIQUE "The response should refuse the harmful operational detail and redirect toward the safety concept the user is curious about." REVISED RESPONSE "I can't help with that specific technique. The underlying concept is X — here's how the safety community thinks about it..." (revised > draft) → becomes a preference pair → trains the model constitution is human-written · critique loop is model-driven
Anthropic · Bai et al. 2022 · arXiv:2212.08073
Constitutional AI (Bai et al. 2022, arXiv:2212.08073) replaces some of that human ranking labor with model self-critique. The fixed reference is a constitution: a short written list of principles the answer should satisfy, such as being helpful, honest, and harmless. The loop runs like this. The model drafts a response. It then grades that draft against the constitution and writes a revision that scores better. The pair — weaker draft, stronger revision — becomes a new piece of the preference training data Page 18 needed, but generated without a human in the loop. RLAIF carries the same substitution into the ranking step, letting the model stand in for the human rater.
Humans write the constitution once; the repeated critique-and-revise work is done by the model. That shifts the cost of alignment from "thousands of raters" toward "compute," and compute scales on a very different curve than hiring does.
Claude is the most visible product built this way. The constitution is not a safety guarantee but a steering document, and a model can follow it imperfectly. Steering also has side effects: it cuts some failure modes, such as overt harmful output, and creates others, such as over-refusing harmless requests or echoing the constitution's tone too eagerly.
Humans write the principles once. The model applies them to its own drafts, at scale.
Page 20/Language
RAG.
Retrieval-Augmented Generation — the architecture under most production AI.
USER QUERY "How does CoWriter handle pushback?" [embed query] → 768-dim vector ┌──────────────────────┐ │ vector DB (Chroma) │ │ ~10K+ chunks │ └──────────┬───────────┘ cosine search TOP-K RETRIEVED CHUNKS 0.84 voice.md "pushback..." 0.77 rules.md "cite princ..." 0.63 reviews.md "direct..." 0.41 voice.md "no openers..." 0.38 setup.md "system pro..." ASSEMBLED PROMPT system + query + chunks + citations ┌──────────────────────┐ │ LLM answers over │ │ fetched context │ │ not from memory │ └──────────────────────┘
Notion AI · Glean · Perplexity · CoWriter · Mosaic all live here
There are three ways to give a model knowledge it did not learn in training. Prompting pastes the relevant document straight into the chat: cheap, but it lasts only as long as that conversation and competes for the limited context window. Fine-tuning trains the model further on your domain data: permanent and expensive, and it teaches style reliably but specific facts unreliably. RAG — Retrieval-Augmented Generation — takes a third route. Ahead of time it splits your corpus into chunks and embeds each as a vector (the geometry from Page 11); at query time it embeds the question, finds the closest chunks, and hands them to the model alongside it.
This is what most shipping enterprise AI looks like underneath. CoWriter is RAG over screenwriting principles plus Bradley's own writing samples; Mosaic is RAG for images, using SigLIP embeddings plus tag vectors in place of text.
RAG cuts hallucination sharply on grounded questions, but does not erase it. The model can still misquote a chunk, or claim with confidence that a chunk says something it does not. And the answer is only as good as what retrieval surfaced: wrong chunks, and the model has nothing true to work from. Retrieval quality matters as much as model quality.
Retrieval finds the relevant text. The model answers only over what it was handed.
Page 21/Language·Summary
LLM, in one breath.
Pretrain · post-train · sample · ground. Four stages, four levers.
WHO OWNS WHAT lab (frontier) you (shipping) ─────────────── ────────────── pretrain sampling settings post-train grounding pipeline system prompts tools / retrieval / verifiers Hallucination is default. Personality is post-training. Knowledge is retrieval. Capability is scale.
Four stages · own them in order · the half you build on is the half you control
The whole language section reduces to four stages, in order. Pretraining gives the model fluency, by predicting the next token across the internet at scale. Post-training gives it judgment, by ranking pairs of its own outputs and tuning toward the better one. Sampling is the runtime knob that makes one fixed model behave like a cautious clerk at low temperature or a loose brainstormer at high. Grounding is what makes the output trustworthy, by tying it to a real source.
The split that matters for a builder runs down the middle of those four. The first two, pretraining and post-training, happen inside a lab on thousands of GPUs and are effectively fixed for you. The last two, sampling and grounding, are yours. When you ship a product on top of someone else's model, the surface you actually control is the sampling settings you choose and the grounding pipeline you build around them.
Four things to carry out of this section. Hallucination is the default behavior, not a defect to patch later. Personality is installed in post-training; the base model has none of its own. Trustworthy knowledge comes from retrieval, since the weights will otherwise improvise. And capability tracks scale, the one lever a builder does not hold.
The lab owns the first half. You own the second.
Section 03 · Part B · 13 pages
generation.
Noise becomes signal, in steps. A network learns to undo noise; compressing the image first made it cheap enough to run on a laptop.
Page 23/Generation
Diffusion.
Noise becomes signal, in steps. Train a network to predict noise; iteratively subtract.
FORWARD (training, fixed math) add a small amount of noise each step t=0 t=300 t=600 t=1000 ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │▣▣▣▣▣│→│▓▣▓▣▣│→│▒▓▒▓▒│→│░░░░░│ │▣▣▣▣▣│ │▣▓▣▓▣│ │▓▒▓▒▓│ │░░░░░│ │▣▣▣▣▣│ │▓▣▓▣▓│ │▒▓▒▓▒│ │░░░░░│ └─────┘ └─────┘ └─────┘ └─────┘ clean noisy noisier pure REVERSE (inference, learned) predict the noise, subtract it t=1000 t=600 t=300 t=0 ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │░░░░░│←│▒▓▒▓▒│←│▓▣▓▣▣│←│▣▣▣▣▣│ │░░░░░│ │▓▒▓▒▓│ │▣▓▣▓▣│ │▣▣▣▣▣│ │░░░░░│ │▒▓▒▓▒│ │▓▣▓▣▓│ │▣▣▣▣▣│ └─────┘ └─────┘ └─────┘ └─────┘ noise less almost image network's only job: predict noise at each step
Two processes · only the reverse is learned · noise is the starting point
Diffusion is trained in two directions, and only one is learned. The forward direction is pure bookkeeping: take a clean image and add a little Gaussian noise, again and again, until nothing is left but static. That direction is fixed math. The reverse direction is the network's job: at each step it is shown a noisier image and trained to predict the exact noise that was added. Once it can do that, you run it the other way — start from static, predict the noise, subtract, repeat — and a coherent image emerges.
It replaced an older approach. Until 2021 the leading image generators were GANs — a generator against a judge, sharp but notoriously unstable. Diffusion won out because predicting noise is a stable regression problem, not an adversarial game. Dhariwal & Nichol 2021 (arXiv:2105.05233) showed it beating GANs on FID, the standard image-quality score.
Diffusion is not cleaning up a damaged photo — there is no hidden original being recovered. The model produces an image consistent with the noise-removal process it learned, and the starting static is raw material, not corruption to be fixed.
The network's only job is to predict the noise. Run it in reverse, step by step, and an image appears.
Page 24/Generation
Latent space.
Compress first. Then diffuse. The single optimization that put image gen on a laptop.
PIXEL SPACE (humans see) ┌──────────────────────┐ │ 1024 × 1024 × 3 │ │ = 3,145,728 numbers │ └──────────┬───────────┘ VAE ENCODE LATENT SPACE (diffusion runs here) ┌──────────────────────┐ │ 64 × 64 × 4 │ │ = 16,384 numbers │ └──────────┬───────────┘ ┌────────────────┐ │ diffusion — │ │ all the work │ └────────┬───────┘ VAE DECODE PIXEL SPACE 1024 × 1024 × 3 (back to humans) COMPRESSION 3,145,728 / 16,384 = 192× smaller SAME FOR SD · SDXL · SD3 · Flux · Veo · Sora
Rombach 2022 · arXiv:2112.10752 · 192× is why this fits on consumer hardware
Running diffusion directly on pixels is brutally expensive. A 1024×1024 color image is 1024 × 1024 × 3 ≈ three million numbers. The network has to process all of them, and it has to do so at every one of hundreds of denoising steps. That is a datacenter-scale job, not a laptop one.
Rombach et al. 2022 (arXiv:2112.10752, the Stable Diffusion paper) solved it by adding a compression stage. They trained a Variational Autoencoder — a VAE, meaning a network whose encoder squeezes an image down to a small representation and whose decoder expands it back. The squeezed version is the latent, typically 64×64×4 or 128×128×16 numbers (the last figure counts channels, the latent's equivalent of color planes). All of the diffusion now happens in this small latent space. Only at the very end does the VAE's decoder turn the finished latent back into full-resolution pixels.
Compare the two sizes the figure shows: about 3,145,728 pixel numbers against 16,384 latent numbers, a 192× reduction. That single compression is the reason image generation runs on consumer hardware at all, and every modern image and video generator uses it. The VAE is also not lossless: when a generated image has strange small-scale artifacts, the culprit is usually the VAE failing to reconstruct fine detail, not the diffusion model getting the picture wrong.
Run the slow part in a space 192× smaller, then let the VAE expand the result back to pixels.
Page 25/Generation
Text encoders.
The prompt becomes a vector sequence. SD1 used one. SD3 and Flux use three.
CLIP-L · CLIP-G
Trained on image-text pairs. Strong on visual concepts. Weak on counting and grammar.
T5-XXL
Trained on language alone. Strong on composition, counting, text-in-image. Slower.
Combined
SD3 and Flux ship all three. Stronger prompt adherence than any single encoder.
PROMPT "A red wine bottle next to two gold candlesticks on velvet" │ ▼ ┌──────────┴──────────┐ ▼ ▼ ▼ CLIP-L CLIP-G T5-XXL (768-d) (1280-d) (4096-d) │ │ │ └──────────┼──────────┘ ▼ concatenated vector sequence → fed to diffusion model
The conditioning stack in 2026 · three frozen encoders · concatenated
Before the diffusion model sees anything, a separate text encoder reads your prompt. It is an encoder in the sense of Page 08: it takes in the whole prompt and outputs a sequence of vectors, one per text token. The encoder is frozen, meaning its weights were trained earlier and are held fixed while the diffusion model trains around it. Those output vectors are the form the diffusion model can actually use, because attention operates on vectors, not on letters.
SD3 and Flux ship three encoders at once because each was trained differently and is strong at something different. The CLIP encoders were trained on image-and-caption pairs, so they are strong on visual concepts but weak on grammar and counting. T5 was trained on language alone, so it is strong on composition, counting, and rendering text inside the image, at the cost of speed. Concatenated — their output vectors stacked into one sequence — they follow a complex prompt more faithfully than any single encoder can.
This reframes a common shorthand. "The model reads the prompt" hides where the language understanding lives: the text encoder reads the prompt and turns it into vectors; the diffusion model only ever attends to those vectors and has no grasp of English on its own. So when a prompt is misunderstood, the encoder is often where it went wrong.
The diffusion model never reads your prompt. It reads the vectors a text encoder made from it.
Page 26/Generation
Two-way talk.
Cross-attention vs MM-DiT. Old: one-way. New: bidirectional, every layer.
CLASSIC CROSS-ATTENTION (U-Net + text condition) image queries → read text keys/values → one-way per layer text tokens "cat" "red" "chair" image patch p1 .7 .2 .1 image patch p2 .1 .6 .3 image patch p3 .2 .1 .7 (text never updates; image attends to it) MM-DiT JOINT SELF-ATTENTION (SD3, Flux) text and image in one sequence ⟷ every token attends to every other "cat" "red" "chair" p1 p2 p3 "cat" * * * * * * "red" * * * * * * "chair" * * * * * * p1 * * * * * * p2 * * * * * * p3 * * * * * * text refines under image · image refines under text · both, every layer
Esser et al. 2024 · arXiv:2403.03206 · attention matrix shape is the architecture
The older design connects text to image with cross-attention. The Query/Key/Value roles from Page 06 are split across the two modalities: the image patches supply the Queries, the text tokens supply the Keys and Values. But the flow runs one way only — the image reads the text, and the text never updates in return.
MM-DiT, the design in SD3 and Flux, removes that asymmetry. It places text tokens and image patches in a single sequence and runs ordinary self-attention over the whole thing. Each modality keeps its own Q/K/V weights, but they share one attention pass. In the SD3 paper's words, it "enables a bidirectional flow of information between image and text tokens."
Because information moves both ways at every layer, text and image sharpen under each other. The payoff is closer prompt adherence, more legible text inside images, and steadier multi-subject handling — structural, not just a larger training run.
Old: the image reads the prompt once per layer. New: text and image revise each other, every layer.
Page 27/Generation
U-Net.
The original diffusion backbone. Convolutional encoder–decoder with skip connections.
INPUTS noisy latent + t + text 64×64 ┓ conv → x-attn ━ skip ━┓ ↑ 64×64 ↓ ┃ 32×32 ┓ down → x-attn ━ skip ━┓ ↑ 32×32 ↓ ┃ 16×16 ┓ down → x-attn ━ skip ━┓ ↑ 16×16 ↓ ┃ 8×8 ┓ down → x-attn ━ skip ━┓ ↑ 8×8 ↓ ┃ └━━━━ bottleneck ━━━━━━━━━━┛ global structure lives here OUTPUT predicted noise tensor RESOLUTION halves each downsample SKIPS preserve detail across squash X-ATTN text conditioning at every level
DDPM · Stable Diffusion 1/2 · the convolutional pyramid that powered the 2022 image-gen wave
The first diffusion models — DDPM, Stable Diffusion 1 and 2 — used a backbone called a U-Net. It is convolutional: it processes an image with small filters that slide across it looking for local patterns, rather than with attention. The latent enters at full resolution, is repeatedly downsampled, halved stage by stage, until a small bottleneck in the middle, then upsampled back up through mirror-image stages. Drawn out, the path down and back traces a U.
Two features make the U work. Skip connections run straight across it, handing each upsampling stage the detailed map from the matching downsampling stage, so fine detail survives the squeeze. And cross-attention layers (the text conditioning from Page 26) sit at the lower-resolution stages. The effect is multi-scale: local texture near the top of the U, global composition at the bottleneck. The entire 2022 image-gen wave ran on this.
What ended its run was scaling. A U-Net does not grow as cleanly as a pure transformer — its custom shape resists the simple "make it bigger" recipe that worked for language. Removing that limit is exactly what the next page, DiT, set out to do.
Shrink the image to capture overall structure, then enlarge it back, while skip connections carry the detail across.
Page 28/Generation
DiT.
Diffusion Transformers. Replace the U-Net with a pure transformer on latent patches.
LATENT IMAGE 64 × 64 × 4 · split into 16 × 16 patches ┌────┬────┬────┬────┐ │ p0 │ p1 │ p2 │ p3 │ · · · ├────┼────┼────┼────┤ │p16 │p17 │p18 │p19 │ · · · ├────┼────┼────┼────┤ │p32 │ ... ... ├────┼────┼────┼────┤ │p240│ │ │p255│ └────┴────┴────┴────┘ 256 patches · 4×4 pixels each · + position embeddings ┌──────────────────────────┐ │ transformer × N │ │ DiT-XL = 28 blocks │ └──────────┬───────────────┘ predicted noise per patch scaling: more Gflops → lower FID
Peebles & Xie 2022 · patches are tokens · the language stack denoises images
Peebles & Xie 2022 (arXiv:2212.09748) made the swap. They kept diffusion as it was but replaced the U-Net denoiser with a plain transformer — the same block stack from the language section — operating on patches of the latent: cut the latent grid into square patches, treat each as a token, add position information, and run the transformer over that sequence. The paper describes it as "replacing the commonly-used U-Net backbone with a transformer that operates on latent patches," and it scales the way language models do — more compute (Gflops), lower FID (DiT-XL/2 reached a then-record 2.27 on ImageNet 256×256).
The deeper consequence is generality. Once the denoiser is a transformer, it no longer cares what the patches represent or how they are arranged. Patches laid out across space give you an image. Add patches across time and you get video, from the very same architecture with a longer sequence — which is exactly the bridge to Page 35.
So DiT is the lineage step that made Sora possible. The U-Net was a custom shape built for images. DiT is the same engine that powers GPT, simply pointed at a different job: denoising patches instead of predicting the next word.
Once patches are treated as tokens, the image side runs on the very same transformer as language.
Page 29/Generation
MM-DiT.
Multimodal Diffusion Transformer. SD3 and Flux's backbone.
ONE JOINT SEQUENCE TEXT TOKENS IMAGE PATCH TOKENS [a] [red] [chair] [on] [velvet] + [p00] [p01] [p02] · · · [p255] │ SEPARATE WEIGHTS per modality │ ┌──────────┐ ┌──────────┐ │ text Q,K,V│ │image Q,K,V│ └─────┬────┘ └────┬─────┘ │ SHARED ATTENTION │ └───────────────┬────────────────────┘ one attention matrix · text & image mix ┌──────────────────────────────────────────────┐ │ text tokens refined under image context │ │ image patches refined under prompt context │ └──────────────────────────────────────────────┘ both, every layer Not prompt pasted onto image. One joint sequence.
Esser et al. 2024 · arXiv:2403.03206 · SD3 · Flux · the 2024-2026 backbone
MM-DiT is the multimodal version of the DiT from Page 28. DiT ran a transformer over image patches alone; MM-DiT puts the text tokens into the same sequence as the image patches and runs one transformer over both. Each modality keeps its own weights for producing Queries, Keys, and Values, but they share a single self-attention pass. The bidirectional text-and-image flow described on Page 26 is the direct consequence of that one shared sequence — this page is simply its proper name and home.
SD3, Flux, and a growing list of 2025-2026 models are built on MM-DiT. What you see from it is closer prompt adherence, markedly better text rendered inside images (logos, signs, captions), and steadier handling of prompts that name several subjects at once. Those gains trace to the architecture change, not to a larger training run.
This is why SD3 and Flux are not simply "a newer Stable Diffusion 1." Architecturally they belong to a different family: the DiT lineage that runs a transformer over patches, rather than the U-Net lineage they replaced. Same diffusion idea, different engine underneath.
Text and image share one attention pass, so each reshapes the other at every layer.
Page 30/Generation
Flow matching.
Regress vector fields, not noise. A cleaner parameterization; same destination.
LEARNED VELOCITY FIELD v(x, t) at every point in space and time: which direction does the data flow? noise side data side t = 0 t = 1 ↘ → → → ↗ ↘ → → ↗ ↘ → ↗ ↘ ↑ ↗ ← image ↗ → ↘ ↗ → → ↘ ↗ → → → ↙ OLD OBJECTIVE (DDPM-style) model predicts the noise added at each step requires noise schedule · hand-tuned · brittle NEW OBJECTIVE (flow matching · Lipman 2022) model predicts the velocity at each point no schedule · stable training · standard ODE solvers at sampling "A simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths."
Lipman et al. 2022 · arXiv:2210.02747 · cousin to diffusion, not replacement
Lipman et al. 2022 (arXiv:2210.02747) reframed what the network is trained to predict. Classic diffusion predicts the noise to remove at each step. Flow matching instead trains the model to predict a velocity: at any point between pure noise and real data, it answers "which direction, and how fast, should this point move to get closer to the data?" Collect those answers across the whole space and you have a vector field, a set of direction arrows like the ones in the figure. The paper describes it as "a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths."
The practical wins follow from that choice. There is no hand-tuned noise schedule to get right, because the model learns the trajectory directly. Training is more stable. And because the result is a velocity field, generating a sample becomes a standard calculus problem — follow the arrows from noise to data with an off-the-shelf ODE solver, the same kind of routine used to integrate any system of motion over time.
Flow matching and diffusion are not rival paradigms. They are close cousins that reach the same place by different math. The field now uses "diffusion" loosely to cover both, so when you read "a diffusion model" in 2026, it may well have been trained this way.
Diffusion predicts the noise to remove. Flow matching predicts the direction to move. Same destination.
Page 31/Generation
Rectified flow.
Straight-line transport. A straighter path from noise to image means fewer, bigger steps.
CLASSICAL DIFFUSION TRAJECTORY (DDPM) curved path · many small steps required to integrate accurately noise ●─╮ ╰─●─╮ ╰─●─╮ ╰─●─╮ curved ╰─●─╮ route ╰─●─╮ ╰─● image ~50 steps (DDIM) · ~1000 steps (DDPM) RECTIFIED FLOW TRAJECTORY (SD3, Flux) straight path · take big steps without losing fidelity noise ●──────────●──────────●──────────●──────────● image step 1 step 2 step 3 step 4 ~20 steps (Flux) · 4× fewer than SD1 THE TRAINING TRICK loss term penalizes curvature straight trajectories are cheap to integrate the math does more work per step
Liu et al. 2022 · arXiv:2209.03003 · adopted by SD3 and Flux · why 2025 image gen got fast
Rectified flow is a particular kind of flow matching. Flow matching (Page 30) learns a velocity field and samples by following its arrows. If those arrows trace a curved path from noise to image, you have to follow it in many small steps to stay accurate, the way you would trace a winding line with short pen strokes. Rectified flow adds a training term that penalizes curvature, pushing the path toward a straight line. The SD3 abstract states it directly: "Rectified flow is a recent generative model formulation that connects data and noise in a straight line."
A straight path is cheap to follow. Each step of the ODE solver can be large without drifting off course, so reaching the same quality needs far fewer steps. Flux reaches a high-quality image in about 20 steps where SD1 needed roughly 50, a 4× cut. The straighter geometry lets the math do more work per step.
SD3, Flux, and a lengthening list of 2025-2026 models train with rectified flow. The noticeably faster image generation of the past year is partly faster hardware, partly better samplers (Page 33), and partly this move to a straighter path.
Straighten the path from noise to image, and you can cross it in 20 steps instead of 50.
Page 32/Generation
CFG.
Classifier-Free Guidance. Run the model twice; amplify the difference.
pred = uncond + scale × (cond − uncond) prompt: "A red wine bottle on velvet" scale 1 generic image → prompt barely pulls ·· ·· (no guidance) scale 4 balanced → prompt visible, ·········· ·· natural texture scale 7 typical chat-UI default → clean adherence, ·················· ·· good texture scale 12 strong adherence → common UI ceiling, ···························· ·· getting harsh scale 20 overcooked → artifacts, oversaturation, ···································· ·· literal-minded composition COST each denoise step runs both conditional + unconditional → 2× compute per step
Ho & Salimans 2022 · arXiv:2207.12598 · the slider in every image-gen UI
Classifier-Free Guidance (Ho & Salimans 2022, arXiv:2207.12598) is a trick for making the model follow the prompt more strongly. It begins in training: about 10% of the time the prompt is dropped, so the model learns to denoise both with conditioning (the prompt) and without it. At generation time, every denoising step is then run twice — once given the prompt, once given nothing. Call those two predictions cond and uncond. The model follows an exaggerated version of their difference, as the figure's formula shows: prediction = uncond + scale × (cond − uncond). The larger the scale, the harder the result is pushed in the direction the prompt added.
That scale is the dial. At scale 1 the prompt barely pulls and you get a generic image. Scale 7 to 12 is the sweet spot most chat UIs default to: clear adherence with natural texture. Past 20 the image is overcooked — oversaturated, harsh, weirdly literal. This dial is exactly the "CFG scale" slider you see in image-generation interfaces.
CFG is the single largest prompt-adherence lever in image and video generation. The cost, noted at the bottom of the figure, is compute: running the model twice per step roughly doubles the work. Veo and other video models use the same trick.
One dial for how hard to push the prompt. Too low is generic; too high is overcooked.
Page 33/Generation
Samplers.
Different paths from noise to image. Quality vs. speed.
RULE OF THUMB
low step count → sampler choice matters a lot
high step count → samplers converge
model + budget → pick the sampler
"Best" sampler depends on architecture and your step budget.
The sampler dropdown · Karras et al. 2022 unified the framework · arXiv:2206.00364
The path a sample takes from noise to image is described by a differential equation, an equation that says how the point should change at each instant. A sampler is a numerical solver for that equation: a recipe for taking discrete steps along the path. Different samplers step differently, trading accuracy against speed in different ways. Karras et al. 2022 (arXiv:2206.00364) unified the many competing proposals into one framework, and modern interfaces expose a handful of the common ones in a dropdown.
How much the choice matters depends entirely on your step budget. At a low number of steps, samplers visibly disagree: same prompt, same seed, different result. At a high number of steps they converge toward the same image, because the path is being traced finely enough that the stepping rule stops mattering. So the "best" sampler is not absolute; it depends on the model and on how many steps you are willing to spend. Flux ships with a flow-matching ODE solver native to how it was trained; SD1 ships with DDIM.
Sampler choice does matter, precisely at the step counts people run in production, around 20 to 30, low enough that the samplers have not yet converged.
The path is fixed; the sampler is how you step along it. At low step counts, the choice shows.
Page 34/Generation·Summary
Gen, in one breath.
Compress · diffuse · condition · sample. Five components.
SAME SKELETON · DIFFERENT MUSCLES SD1 VAE + U-Net + CLIP + cross-attn + DDIM SDXL VAE + U-Net + CLIP-L + CLIP-G + cross-attn + DDIM SD3 VAE + MM-DiT + CLIP-L + CLIP-G + T5 + joint + rectified flow Flux VAE + MM-DiT + CLIP-L + CLIP-G + T5 + joint + rectified flow ODE
The full image-gen stack · two architectural shifts defined 2024–2026
Every modern image generator is the same five pieces, with different choices plugged into each slot. The pieces: a VAE to compress and decompress (Page 24), diffusion to turn noise into a latent (Page 23), a backbone to do the denoising (U-Net or DiT), a conditioning stack to read the prompt (Page 25), and a sampler to walk the path (Page 33). SD1 fills those slots with VAE + U-Net + CLIP + cross-attention + DDIM. Flux fills the same slots with VAE + MM-DiT + CLIP-L + CLIP-G + T5-XXL + joint attention + a rectified-flow ODE. The same five slots, different parts in each.
Two of those slots changed enough to define the 2024-2026 era. The backbone moved from U-Net to MM-DiT, joining the DiT lineage. The training objective moved from classical noise-prediction diffusion to rectified flow. Both shifts originated in papers, not products. The faster, sharper image and video tools you have used this year are the downstream consequence of those two changes.
Every image model is the same five pieces. Products differ only by what fills each slot.
Page 35/Bridge
Spacetime patches.
Video is a longer patch sequence run through the same diffusion transformer.
IMAGE single frame · spatial only ┌────┬────┬────┬────┐ │ p0 │ p1 │ p2 │ p3 │ ├────┼────┼────┼────┤ │ p4 │ p5 │ p6 │ p7 │ └────┴────┴────┴────┘ 8 patches · sequence length 8 VIDEO spatial × temporal · the same grid, stacked over time t=0 ┌──┬──┬──┬──┐ t=1 ┌──┬──┬──┬──┐ t=2 ┌──┬──┬──┬──┐ │p00 ·· p05│ │p08 ·· p15│ │p16 p17·· p23│ └──┴──┴──┴──┘ └──┴──┴──┴──┘ └──┴──┴──┴──┘ sequence = p00, p01, ... p23 · attention runs across space + time p17 (later) attends to p05 (earlier) ↳ how coherence works variable resolution, duration, aspect ratio · same model, all
Sora · Feb 2024 · "a still image is videos with a single frame"
Page 28 promised that video falls out of the same architecture. The Sora technical report puts it directly: "We turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches." A spacetime patch is a small chunk of video — a few pixels wide and tall, a few frames long — and the Page 28 diffusion transformer treats each as one token, exactly as it treated flat image patches. A still image is simply "videos with a single frame."
Because the model only ever sees a sequence of patches, format stops mattering: a longer or larger video is just a longer sequence, and the transformer handles any length — vertical or widescreen, seconds or a minute, one model with no per-format code path.
Sora does not generate frames one after another. It denoises the entire grid of spacetime patches at once — frame N+1 is not predicted from frame N. Because attention runs across the whole sequence, every patch can attend to every other, earlier or later. That all-pairs view is why long-range coherence — a character staying the same person across a shot — holds at all.
A still image is one frame. Video is the same patches, with time as one more axis.
Page 36/Bridge
Emergent physics.
3D, object permanence, world simulation. Phenomena of scale.
NO EXPLICIT MODULES [no 3D engine] [no object database] [no rigid-body sim] [no physics priors] YET, AT SCALE, THESE EMERGE camera move → objects stay coherent in 3D occlusion → objects persist after leaving frame paint stroke → marks stay on the canvas character turn → identity mostly holds liquid pour → plausible flow direction multiple shots, same person → wardrobe and face hold across cuts FAILURES THAT REMAIN glass shatter → wrong fragments, wrong sound hands holding tools → intermittent contact, mis-grip cause / effect chains → plausible but not reliable long videos (>30s) → drift, identity loss "These properties emerge without any explicit inductive biases for 3D, objects, etc. — they are purely phenomena of scale." — Sora technical report, Feb 2024
Emergent ≠ correct · the most direct visual demonstration of the bitter lesson
Sora shows behaviors nobody wrote into it: move the camera and objects stay consistent in 3D; let one pass behind another and it survives the occlusion; cut between shots and a character mostly keeps its face and clothes. None of this was coded as a rule. In the report's words, "these properties emerge without any explicit inductive biases for 3D, objects, etc. — they are purely phenomena of scale." An inductive bias is a built-in assumption a designer bakes in; Sora got none of these, and developed them anyway.
The mechanism is the striking part. The team did not build a 3D engine, an object tracker, or a physics simulator. They built one thing: a model that predicts spacetime patches, trained at very large scale. Something that behaves like a physics engine emerged inside it — not because anyone asked for physics, but because crudely tracking how the world moves is the cheapest way to predict what the next patches will look like.
This is also where to stay careful. Saying "Sora understands physics" claims too much. It models physics well enough to predict patches most of the time, and the team is explicit about where that breaks: "it does not accurately model the physics of many basic interactions, like glass shattering." The figure lists more of these failures. Emergent is not the same as correct: the behavior appeared because of scale, but appearing is not the same as being reliable.
They built a patch predictor. A physics engine emerged.
Page 37/Bridge
The bitter lesson.
Same answer, three times. Language, image, video — one curve.
LANGUAGE (Kaplan/Chinchilla) IMAGE (DiT) VIDEO (Sora compute scan) loss FID sample quality 3.2│* * base ▓▓ 2.8│ * * 4× ▓▓▓▓▓ 2.4│ * * 32× ▓▓▓▓▓▓▓▓ 2.0│ * * 1.6│ * * 1.2│ * └─────────────── └───────────── └──────────── compute → Gflops → compute → THE TURN general method + scale > hand-built priors SUTTON 2019 "We have to learn the bitter lesson that building in how we think we think does not work in the long run." THE NUANCE loss curves don't break · specific benchmarks saturate the open question is whether data is the binding constraint
Three domains · one curve · same conclusion · Sutton's bitter lesson, applied three times
Rich Sutton named the pattern in 2019: across the history of AI, general methods that scale with compute have repeatedly beaten clever, hand-crafted approaches built on human insight (his blunt version is in the figure). It is called bitter because the hand-crafted approaches are the ones researchers are proudest of, and scale keeps winning anyway. What makes this deck's moment notable is that the past decade ran the same experiment in three domains — language, image, and video — and got the same result in each.
The figure puts three curves side by side. Language models follow the Kaplan and Chinchilla scaling laws from Page 14. DiT, the image backbone, follows the same curve: more compute, lower FID. Sora's published comparison shows the same shape for video — scale compute from a base run to 4× to 32×, and sample quality climbs the whole way. Three independent fields, one relationship.
Both extreme readings are wrong. "Scaling is hitting a wall" overstates it: specific benchmarks saturate, but the underlying loss curves keep falling. The open question is whether data is the binding constraint — whether we run out of high-quality training data before the curves give out.
Three domains, one curve, same answer: general method plus scale beats hand-built priors.
Page 38/Bridge
Same engine.
Self-attention runs everywhere. Different tokens go in; the operation itself is identical.
UPSTREAM OF ATTENTION how to tokenize · how to condition · how to compose DOWNSTREAM OF ATTENTION how to decode → text · pixels · motion the middle is the same
One primitive · five surfaces · the architectural creativity is at the edges
Line up every system in this deck and one thing repeats. A language model runs self-attention over text tokens. CLIP and SigLIP run it over image patches. DiT and MM-DiT run it over latent image patches. Sora runs it over spacetime patches. The multimodal models, Gemini and GPT-4o, run it over a single mixed sequence of text tokens and image tokens projected into the same space. The operation from Page 06 — every element attending to every other — sits inside all of them.
What changes from one system to the next is never that operation. It is the two things on either side of it. Upstream sits the tokenization and conditioning: how the world is cut into a sequence, and how a prompt is fed in. Downstream sits the decoder: how the output sequence is turned back into text, pixels, or motion. All the architectural creativity lives at those two edges. The middle stays the same.
One attention operation. Different tokens in, different decoders out.
Page 39/Bridge·For builders
Work with the grain.
The model sits in the middle. What you build is the scaffolding around it, and that scaffolding is the work.
USER INTENT ├─ retrieval embeddings → top-K context ← Page 20 ├─ tools APIs · files · databases · code ← grounding via execution ├─ steering system prompt · examples · policy ← Pages 17-19 ├─ sampling temperature · top-p · structured ← Page 15 └─ evaluation citations · checks · human review ← trust UX ▼ ┌─────────┐ │ MODEL │ ← shared infrastructure · not yours to redesign └────┬────┘ ▼ grounded answer / image / action THE MODEL'S JOB YOUR JOB fluent autocomplete → give it good context fingerprint matcher → give it good queries denoiser → give it good conditioning scaffold the model · do not redesign it
CoWriter · Mosaic · every serious AI product · the architecture that ages well
If you build products on these models — CoWriter on LLMs, Mosaic on SigLIP plus Gemini Vision — the choices that age well respect how the model actually works. Use embeddings for retrieval (the geometry from Page 11). Use the post-training surface (system prompts, constitutions where available) for steering, the layer personality lives on. Use grounding (RAG, tools, citations) for truth, because the model cannot supply it on its own. The model has a grain; these choices run along it, not against it.
Each model comes down to one job, and your task is to feed it well — the figure's two columns say it: fluent autocompleter, fingerprint matcher, denoiser, each wanting good context, good queries, good conditioning. Supply the input the model is built to use, and stay out of its way.
It is a division of labor. The model is shared infrastructure: you did not train it and will not redesign it. What you own is everything around it — retrieval, prompts, tools, checks, interface. That scaffolding is where a product is won or lost, and it is the part that is yours to build.
The model is shared infrastructure. The scaffolding around it is the part you actually build.
End Plate · FIN · 40 / 40
One substrate. Two stacks.
The engine is attention · models turn fluent before they turn correct · scale finds the priors no one coded · the scaffolding around the model is your job.