A Plain-Language Manual · 40 Pages

AI Models.

How language, image, and video models work, taken apart one mechanism at a time.

Bradley Tangonan 2026-05-28 Read in order, or jump via dock 01 / 40

Page 02/Intro

Two stacks. One substrate.

Seven familiar products. Two ways to decode. One shared operation doing the work in the middle.

LANGUAGE                GENERATION
────────                ──────────

tokens                  latent patches
   │                          │
   ▼                          ▼
embed                   VAE encode
   │                          │
   ▼                          ▼
┌──────────┐            ┌──────────┐
│transformer│            │transformer│
│decoder×80│            │ DiT × 28 │
│causal    │            │ full     │
└──────────┘            └──────────┘
   │                          │
   ▼                          ▼
unembed                 VAE decode
   │                          │
   ▼                          ▼
next token              image/video
   │                          │
   ▼                          ▼
sample · repeat         final frame

──── attention(Q,K,V) ────
identical · every layer · every block

Two stacks · one engine · the substrate is attention

Start with seven products you have probably used. On the language side: ChatGPT, Gemini, Claude. On the image and video side: Midjourney, Veo, Sora, Flux. They feel like different kinds of software. Underneath, they run the same core operation. It is called attention, and the next four pages take it apart properly. For now one sentence is enough: attention lets every element in a sequence look at every other element at the same time, and update itself based on what it finds.

The two families differ in what goes in and what comes out, not in the engine. A language model takes in fragments of words and builds its answer one fragment at a time. An image or video model takes in random noise and removes that noise, in steps, until a picture is left. Different inputs and different outputs, with the same attention operation doing the work between them.

This manual is organized around that shared middle. Thirteen pages walk through the language stack. Thirteen walk through image and video. Five trace what the two have in common. Each page covers one idea. Read in order, or use the dock along the bottom to jump.

Text and pixels need different decoders. The operation between them is the same.

Page 03/Intro

Why · what. Now · how.

Every concept page follows the same four-beat shape, in the same order.

¶1 · why

The hook.

The problem this concept exists to solve.

"Pixel-space diffusion is prohibitively expensive."

¶2 · what

The mechanism.

Stated in one paragraph.

"A VAE compresses 1024² to 64²; diffusion runs there."

¶3 · now

The consequence.

Where it shows up in products you've used.

"Stable Diffusion, Flux, SD3, Veo, Sora — all use this."

─ · how

The takeaway.

Layperson model + the common wrong claim.

"Sculpt the storyboard. The VAE isn't lossless."

Four beats · the shape every concept page follows

Every concept page is built on the same four beats, in the same order. First, why the idea exists: the problem it was invented to solve. Second, what it actually does, stated as a mechanism in one paragraph. Third, where it already shows up in products you have used. Fourth, the short version most people get wrong, named and then corrected.

The layout supports those beats. The figure on the left is the anchor: an equation, a worked example, a comparison table, or an annotated flow. The prose on the right narrates what the figure shows, one step at a time. The single line at the bottom is the one sentence to keep if you keep nothing else from the page.

Navigation is keyboard-first. The arrow keys or the space bar move forward and back. The number keys 1 through 9 jump to the first nine pages. Home and End jump to the first page and the last. The dock along the bottom groups every page by section.

One concept per page. One line to keep.

Page 04/Intro

The transformer.

One block — attention plus a small network — stacked dozens deep. The engine inside every model here.

INPUT  sequence of tokens / patches
[t0] [t1] [t2] [t3] ... [tN]
              ↓
         [embeddings]
              ↓
   ┌──────────────────────┐
   │    BLOCK × N         │
   │                      │
   │  ↓ self-attention    │
   │  ↓ + residual        │
   │  ↓ MLP (per token)   │
   │  ↓ + residual + norm │
   │                      │
   │  N = 32 to 120       │
   └──────────┬───────────┘
              ↓
       [output head]
              ↓
OUTPUT  same-shape sequence
        → next-token   (LLM)
        → noise        (diffusion)
        → fingerprint  (encoder)

SAME ENGINE, DIFFERENT TOKENS
 LLMs              text tokens
 CLIP / SigLIP     image patches
 DiT / MM-DiT      latent patches
 Sora              spacetime patches

One engine · different tokens · the middle of every model in this deck

A transformer is a single repeating part, stacked. The part is called a block, and every block has the same structure. Inside one block, two things happen in order. First a self-attention step, where each element in the sequence looks at every other element. Then a feed-forward step, called an MLP (a small network applied to each position on its own). Both steps are wrapped in residual connections, which means a step's input is added back onto its output. Because of that addition, each block edits the running representation rather than overwriting it. Stack 32 to 120 of these blocks and you have a transformer.

What flows through the stack is a sequence of vectors. A vector here is a list of numbers the network uses to represent one element. The elements can be fragments of words, patches of an image, or patches of video — anything that can be cut into a sequence. The output is a sequence of the same shape, which a decoder then reads in whatever way the product needs. A language model reads it as a probability distribution over the next token. A diffusion model reads it as an estimate of the noise to remove. An image embedder pools it down into a single summary vector.

So what changes from product to product is the tokenization that feeds the engine and the decoder that reads its output, not the engine itself. The middle stays the same. The next three pages open that middle up: first the attention operation itself, then the block it lives in, then the three architectural shapes the family splits into.

Change what you feed in and what reads the output, and the same engine becomes a different product.

Section 01 · The Substrate · 4 pages

attention.

The operation under everything. Language and generation share this mechanism before they split into different decoders.

Pages 05 → 08 Self-attention · transformer block · three shapes

Page 06/Attention

Q · K · V.

Query, Key, Value. Every element asks every other element a question, all at the same time.

SEQUENCE
[the] [cat] [sat] [on] [the] [mat]
              ↓
        ATTENTION PASSSCORES FOR "cat"      SOFTMAX
cat → mat  = 2.1   0.62  ██████
cat → sat  = 0.9   0.19  ███
cat → on   = 0.4   0.10  ██
cat → the  = 0.2   0.06  █
cat → cat  = 0.1   0.03  ▏

UPDATE  (weighted sum of Values)

new(cat) =  0.62 · Vmat
         +  0.19 · Vsat
         +  0.10 · Von
         +  0.06 · Vthe
         +  0.03 · Vcat

all 6 tokens update in parallel

One attention head · the all-pairs operation in one slide

The most common mental model of language AI — that the model reads left to right and predicts the next word — describes only the output behavior. Inside, something different happens. At every layer of the network, every position in the input sequence updates itself by looking at every other position simultaneously. There is no "first one, then the next." It is one parallel operation across the whole sequence.

Each token in the sequence (each word fragment, in the language case) gets converted into three different vectors. A vector here means a list of numbers — say, 768 of them — that the network uses as that token's working representation. The three vectors serve three different purposes.

Query is "the kind of other token I am looking for in this sentence." If the token is cat, its Query might be tuned to look for verbs and locations.

Key is "what kind of token I am, advertised so others can find me." The Key for mat might advertise itself as a location.

Value is "if you decide to attend to me, this is the contribution I will pass along to your update."

Now the comparison. The Query for one token is compared against the Key of every token in the sequence, using a dot product. A dot product is a similarity score between two vectors: high when the two point in similar directions, low when they don't. So each pair gets a number. Softmax then takes those raw scores and turns them into percentages that sum to 100. Those percentages are the attention weights. Finally, the token rewrites itself as a weighted average of every Value in the sequence, using those percentages. If cat gave 62% of its weight to mat, the new cat representation is 62% "what mat contributes" plus 19% "what sat contributes" plus the rest. The whole sequence does this at the same time, and that single pass is one layer of one attention head.

One more layer of structure. Running this operation once is a single attention head. A block runs several heads in parallel, and each head can specialize in a different kind of relationship — one tracking grammatical subjects, another tracking nearby modifiers. A common shorthand says "the model pays attention to the important words." That is not so much wrong as vague. The actual content is the table of weights in the figure, recomputed for every token at every layer.

Every token rewrites itself from a weighted blend of every other token — at every layer, all at once.

Page 07/Attention

Stacked.

The same block, repeated in series. Depth times heads is where composition comes from.

TRANSFORMER BLOCK

   input  ────┐
              │
              ▼
       ┌─────────────┐
       │  attention  │  ← all pairs · multi-head
       └──────┬──────┘
              │  + residual
              ▼
       ┌─────────────┐
       │     MLP     │  ← per-position non-linearity
       └──────┬──────┘
              │  + residual + layernorm
              ▼
   output ────┘   (same shape as input)

REPEAT × N     N = 32 (small) → 80 (Llama-70B) → 120+ (frontier)

One block · the same two steps · stacked N times

A single block does two things, both introduced on the previous page. First an attention step, where every position reads every other position. Then a feed-forward MLP step, applied to each position on its own. A residual connection wraps each step, which means the step's input is added back to its output, so the block adjusts the running representation instead of replacing it. Frontier models stack 32 to 120 of these blocks.

Follow one token's representation up the stack to see why depth matters. The first block runs attention, so the token absorbs a weighted blend of every other token. Then the MLP reshapes that token on its own. The result is handed to the second block, which runs the same two steps again — but now on representations that already carry context from the first pass. Repeat for every block. By the top of a 100-block stack, each token has been informed by every other token dozens of times over, each pass working at a slightly more abstract level than the last.

What ends up living at each depth is not designed. It emerges from training. In practice, early blocks tend to resolve local grammar, middle blocks tend to build meaning and relationships, and late blocks assemble the final prediction. Nobody assigns those roles. They fall out of optimizing the same next-token objective across the whole stack.

Each block reworks the sequence using everything the block below it already worked out.

Page 08/Attention

Three shapes.

Encoder · Decoder · Encoder–Decoder. All modern chat LLMs are decoder-only.

Encoder-only

BERT · CLIP text · SigLIP. Reads the whole sequence at once, bidirectional. Outputs one fingerprint. Used for classification and embeddings.

Decoder-only

GPT · LLaMA · Gemini · Claude. Reads left-to-right, predicting the next token; a causal mask blocks it from seeing ahead. Every modern chat LLM lives here.

Encoder–Decoder

T5 · original Transformer · NMT models. Reads input, then writes output. Used historically for translation. Mostly legacy.

READ PATTERN
encoder-only       [tok][tok][tok][tok]   ⟷ all pairs, bidirectional
decoder-only       [tok][tok][tok][___]   ← causal mask, left-only
enc-dec            [tok][tok][tok] → [out][out][out]   cross-attend

The family tree · three branches · the chat models you use are all decoders

All three architectures use the same attention operation. What separates them is what each one is allowed to read and what it produces. An encoder reads the whole sequence at once, with every position free to look both left and right, and outputs a single summary vector, a fingerprint of the input. A decoder reads strictly left to right and writes one token at a time, each new token predicted only from the tokens before it. An encoder-decoder does both in sequence: it encodes the input, then writes a fresh output from that encoding.

Each shape maps onto products you know. Every chat model in 2026 — GPT, Llama, Gemini, Claude — is decoder-only, because holding a conversation is exactly that left-to-right writing task. Every image retriever, such as CLIP or SigLIP, is encoder-only, because retrieval needs one fingerprint per item to compare. Image generators sit outside this table; they are diffusion models that happen to use transformers inside, and they get their own section from Page 22 on.

One misconception is worth naming directly. People often call GPT an encoder-decoder model. It is not, and the difference matters: a decoder-only model has no separate reading stage at all, it simply continues the sequence it is handed. The encoder-decoder shape still exists, but today it lives mostly in older machine-translation systems.

Encoders read the whole input into one fingerprint. Decoders write one token at a time. Every chat model is a decoder.

Section 02 · Part A · 13 pages

language.

Decoder-only transformers, trained to predict the next token. Pretraining is simple and does most of the work. Post-training is where the personality comes from.

Pages 09 → 21 Tokens · embeddings · sampling · post-training · RAG

Page 10/Language

Tokens.

The model's alphabet. Subword chunks chosen by frequency.

PHRASE                   TOKEN SPLIT                       VOCAB IDS
"unbelievable"           [un] [believ] [able]                359 · 40471 · 481
"ChatGPT works"          [Chat] [G] [PT] [ works]            16047 · 38 · 2898 · 4375
"résumé.pdf"             [r] [és] [um] [é] [.pdf]            81 · 7206 · 372 · 978 · 14329
"New York-based"         [New] [ York] [-] [based]           3648 · 4356 · 12 · 3100
"你好"                    [UTF-8 byte] [UTF-8 byte]              224 · 121 · 224 · 165
"🎬"                     [UTF-8 byte] × 4                       240 · 159 · 142 · 172

PRICING SHAPE
1 English word        ≈ 1.3 tokens
1 Chinese character   ≈ 2 tokens (byte-level fallback)
1 emoji               ≈ 4 tokens
1 long URL            ≈ 1 token per chunk between / · - _

Same text · different tokenizer · different split, IDs, and cost.

Real tokenization · 6 examples · the bill is in tokens, not words

A model never sees letters or whole words. Before any text reaches the network, a tokenizer chops it into tokens. A token is a chunk of text: often a common word, often a fragment of a rarer one. The tools that do the chopping have names you will meet in code — BPE, SentencePiece, tiktoken — and each is a recipe for deciding which chunks are frequent enough to deserve their own slot. Common words usually become one token. Rare words break into several pieces. Each token is then converted into an integer ID, which is a position in a fixed vocabulary of roughly 30,000 to 200,000 entries.

The token is also the unit of money and memory. You are billed per token. The context window — the amount of text a model can hold in view at once — is measured in tokens. The cost of running the model scales with tokens. So every question about "how much fits in the context window" is really a question about token counts, not word counts.

The trap is to treat one token as one word. They are not the same. The string "GPT" is one token in one tokenizer, three in another, and four if the tokenizer falls all the way back to raw bytes. When the exact count matters — for a cost estimate, or for fitting a long document — run the real tokenizer instead of guessing from the word count.

The model's alphabet is tokens, not words. Your bill is counted in them too.

Page 11/Language

Embeddings.

A map of meaning, learned from co-occurrence.

2D PROJECTION OF A HIGH-DIM EMBEDDING SPACE

 y
 4 │          queen(-0.2, 3.8)        king(0.4, 3.7)
 3 │     woman(-1.4, 2.9)             man(1.0, 2.8)
 2 │
 1 │  nurse(-2.0, 1.1)             doctor(1.7, 1.2)
 0 ┼──────────────────────────────────────────────────  x
   -3       -2        -1         0         1         2

COSINE SIMILARITY  (the standard metric)
cos(king, queen)   = 0.82   ██████████████████  near
cos(king, man)     = 0.79   █████████████████   near
cos(king, doctor)  = 0.41   █████████           middling
cos(king, banana)  = 0.07   █                   unrelated

VECTOR ARITHMETIC
king − man + woman ≈ queen     (the famous toy)
Paris − France + Italy ≈ Rome  (works for many relations)

Neighbors are geometric, not dictionary lookups.

Embedding geometry · learned from co-occurrence · queried via cosine

Recall that each token is an integer ID. The embedding table turns that ID into a vector. The table holds one row per token in the vocabulary, and each row is a list of 768 to 4096 numbers. Looking a token up means reading off its row. That row of numbers is what actually enters the transformer; the integer ID was only an address pointing at it.

Those rows begin as random numbers. During pretraining they are nudged, again and again, so that tokens used in similar contexts end up close together in this space of numbers. Treat each row as a set of coordinates: after training, king and queen land near each other, and far from banana. Nothing inside a vector "knows" what a king is. The closeness is only the accumulated residue of training. You read that closeness with cosine similarity, a measure of the angle between two vectors — a small angle means similar, a near-right angle means unrelated. The figure shows the actual scores.

This single geometry does a great deal of downstream work. It powers retrieval (the R in RAG, Page 20), clustering (grouping similar items, as in the taste-graph), and cross-modal alignment: CLIP and SigLIP are trained so that an image and its caption land near each other in one shared space. Firth put it more plainly in 1957: you shall know a word by the company it keeps.

Meaning here is a location in space: learned from what a word co-occurs with, measured by the angle between vectors.

Page 12/Language

The stack.

Decoder-only, causal mask, residual. One block, repeated.

INPUT TOKENS  (each one can only see tokens to its left)
[The] [cat] [sat] [on] [the] [___]
                  ↓
              [embed]
                  ↓
           [block 01]   attn + MLP
                  ↓
           [block 02]   attn + MLP
                  ↓
                ···
                  ↓
           [block 48]   attn + MLP
                  ↓
             [unembed]
                  ↓
NEXT-TOKEN DISTRIBUTION
over ~50,000 vocab entries

  [mat]    █████████████  0.41
  [floor]  █████          0.16
  [couch]  ███            0.10
  [table]  ██             0.06
  [.]      █              0.04
  [grass]  ▏              0.03
  ...      49,994 more, near zero

  → sample → append → repeat

One token in, one distribution out · sample, append, repeat

A decoder-only language model is the block from Page 07 with exactly one rule added. When attention runs, each token may look only at tokens that came before it, never at tokens ahead. That restriction is the causal mask. It is what forces the model to work left to right, and it is the only structural difference between this stack and the generic transformer.

A single forward pass runs end to end. The input tokens are embedded into vectors. Those vectors travel up through every block. At the top, the representation of the last position is projected back out into vocabulary space — turned into one raw score for every possible next token. A softmax converts those scores into a probability distribution. The model samples one token from that distribution, appends it to the sequence, and runs the whole pass again. Embed, climb the stack, project, sample, append, repeat.

The model never plans the whole sentence in advance. It commits to one token at a time, then reconsiders everything to choose the next. Reasoning models (o1, Gemini Thinking, Claude's extended thinking) appear to deliberate, and they do generate long chains of intermediate reasoning before the final answer. But that reasoning is itself produced one left-to-right token at a time. The mechanism underneath does not change.

The same block, dozens of times, with one rule added: each token may read only what came before it.

Page 13/Language

Predict the next.

The entire pretraining objective. The document is the label.

TRAINING ROWS — one document, every position a label

prefix                                  correct next   P(model)   loss
─────────────────────────────────────────────────────────────────────
"The cat sat on the"                    "mat"          0.41       0.89
"Paris is the capital of"               "France"       0.72       0.33
"for i in range("                       "10"           0.18       1.71
"She opened the door and"               "saw"          0.09       2.41
"def fibonacci(n):"                     "\n"           0.81       0.21
"The patient presents with"             "shortness"    0.04       3.22

   loss = −log P(correct next token)

SCALE
~15 trillion tokens                  one shard of training data
~1 epoch                             frontier models barely see data twice
no human "this is correct" labels    the document IS the label
loss falls predictably               see Page 14 (scaling laws)

Cross-entropy on the next token · no human labels · trillion-token scale

The whole of pretraining is one task, repeated across trillions of tokens. Show the model a prefix of text. Ask it for a probability distribution over what the next token should be. Then compare its answer against the token that actually came next in the source document. The penalty for being wrong is the cross-entropy loss, which in plain terms measures how little probability the model placed on the correct next token: the less it bet on the right answer, the larger the penalty. No human writes an answer key. The document is its own answer key, because the next word is simply whatever the author wrote.

A narrow task, but it forces broad ability. To predict the next token well, the model has no choice but to absorb the things that make text predictable in the first place: grammar, meaning, facts about the world, the rules of code, the steps of arithmetic, the shape of a dialogue. None of these is taught directly. Each emerges because it is the cheapest way to lower the loss. A model that has quietly learned how a country relates to its capital will predict "France" after "Paris is the capital of" far more reliably than one that has not.

This same objective is where hallucination begins. The model is rewarded for producing a plausible continuation, not a true one. Sometimes plausible and true are the same sentence. Often they are not. A confident, well-formed, false statement still scores well during training, as long as it resembles something the data might contain. The model did not memorize the internet. It learned the statistical structure of it, and structure can be convincingly faked.

Broad knowledge is a side effect of one narrow drill: predict the next token well enough, often enough.

Page 14/Language

Scaling laws.

Loss falls predictably with compute, params, and data.

LOG LOSS vs LOG COMPUTE  (Kaplan 2020 · Hoffmann 2022)

  loss
   3.2 │ *  GPT-2 (2019)
   2.8 │       *  GPT-3 (2020, undertrained)
   2.4 │             *  Chinchilla 70B (compute-optimal)
   2.0 │                   *  LLaMA-2 70B
   1.6 │                         *  GPT-4 class
   1.2 │                               *  ?
       └─────────────────────────────────────────────
        10²¹      10²²      10²³      10²⁴      10²⁵  FLOPs

CHINCHILLA RULE
   compute-optimal training:  ~20 tokens per parameter
   70B model → 1.4T tokens; not 300B (which is what GPT-3 did)

THE FORECAST
   loss ∝ (compute)−α     α ≈ 0.07 for language
   smooth · predictable · not automatic AGI

Power law · forecast with error bars · the curve frontier labs spend hundreds of millions to ride

Two papers established the result. Kaplan et al. 2020 (arXiv:2001.08361) and Hoffmann et al. 2022 (arXiv:2203.15556, the Chinchilla paper) showed that training loss falls in a predictable way as you increase three things: compute (the total arithmetic spent training, measured in FLOPs), the number of parameters (the model's adjustable weights), and the amount of training data. The relationship is a power law: plot loss against compute on logarithmic axes and the points fall along a straight line, as the figure shows. Chinchilla added a correction to the recipe. For a fixed compute budget, you do better with a smaller model trained on more data than the field assumed in 2020 — about 20 tokens of training data per parameter. By that rule, a 70-billion-parameter model wants roughly 1.4 trillion tokens, not the 300 billion that GPT-3 was trained on.

The practical payoff is foresight. You do not have to guess whether a model ten times larger will be better. You fit the curve on smaller runs and extrapolate it. This is why a lab will commit a hundred million dollars to a single training run with confidence: the outcome is a forecast with error bars, not a gamble.

One caution keeps this honest. The laws predict loss, not capability. Loss falls smoothly, but specific abilities sometimes appear in jumps the smooth curve never foreshadowed. So the claim that "scaling laws mean AGI by 2027" reads more into the curve than it actually says. The curve forecasts a number going down, not a particular skill arriving on a date.

Spend more compute and the loss falls along a curve you can predict before you spend it.

Page 15/Language

Sampling.

Temperature, top-p, top-k. How a distribution becomes a word.

NEXT-TOKEN DISTRIBUTION AT THREE TEMPERATURES
prompt: "The cat sat on the ___"

T = 0.0  deterministic · always pick the argmax
  mat      ██████████████████████████████  1.00
  floor    .
  couch    .

T = 0.7  sharper · the chat default
  mat      ██████████████████              0.58
  floor    ██████                          0.22
  couch    ████                            0.12
  rug      ██                              0.08

T = 1.5  flatter · "creative"
  mat      █████████                       0.31
  floor    ███████                         0.24
  couch    █████                           0.18
  rug      ████                            0.14
  table    ███                             0.13

top-p (nucleus): keep smallest set where cumulative P ≥ 0.9
top-k: keep top k candidates · cruder than top-p

Three temperatures · same model · three different collaborators

Page 12 left the model holding a probability distribution over the entire vocabulary at each step. Sampling is the rule that turns that distribution into one chosen token. Three knobs shape the choice.

Temperature rescales the raw scores (the logits) before the softmax turns them into probabilities. At temperature zero the model is deterministic: it always takes the single highest-probability token, the argmax. At temperature one it samples honestly from the distribution as the model reported it. Above one, the distribution flattens toward uniform, so unlikely tokens get a real chance. Top-p, also called nucleus sampling (Holtzman et al. 2019, arXiv:1904.09751), takes a different cut: it keeps only the smallest set of top tokens whose probabilities add up to p, then samples within that set. Top-k is the blunter cousin — keep the k most likely tokens and discard the rest.

Sampling is the largest behavioral lever you have without retraining anything. The same model at temperature 0.2 and at temperature 1.2 can feel like two different collaborators. CoWriter's peer-pushback voice is part system prompt, part temperature. The figure makes one warning visible: higher temperature is more random, which is not the same as more creative. As temperature rises, so does the rate of hallucination, because you are deliberately giving low-probability, often wrong, tokens more room to win.

Temperature zero always picks the most likely next token. Temperature one rolls dice weighted by the model's probabilities.

Page 16/Language·Keystone

Hallucination.

What an ungrounded next-token predictor does by default.

OBJECTIVE MISMATCH

   input claim type           plausible?    true?       outcome
   ──────────────────────────────────────────────────────────────────
   real fact                  yes           yes         useful answer
   urban myth · stereotype    yes           no          hallucination
   fabricated citation        yes           no          fake authority
   plausible URL              yes           no          dead link
   "I don't know"             no            true        often under-rewarded
   retrieved fact + cite      yes           grounded    safer answer

WHY IT HAPPENS
   training objective         →  rewards plausibility
   production needs           →  reward truth
   the gap                    →  hallucination

MITIGATIONS  (architectural · not training)
   retrieval (RAG)            →  ground in fetched context
   tool use                   →  ground in execution results
   citations / structured     →  ground in source spans
   verifier models            →  filter the worst cases

The framing keystone · everything downstream of here is grounding

Put Page 13's lesson in one line and keep it in view: the model is doing exactly what it was trained to do. Training rewards a plausible next token. The person reading the output wants a true one. Hallucination is not a malfunction; it is the name for the distance between those two targets. A model with no access to a source of truth has no way to prefer the true completion over the merely plausible one.

Grounding is the general fix. It means giving the model an external source of truth to lean on at the moment it answers: text retrieved from a database, the result of a tool it ran, a document quoted with citations, a structured record. Without grounding, the model is a fluent autocompleter running past the edge of what it reliably knows. With grounding, its job changes — from inventing a plausible answer to composing one over context it has actually been handed.

This is why every shipping AI product has a grounding layer of some kind. Mechanically that layer is usually RAG (Page 20); in the interface it usually appears as citations. A bigger model lowers the hallucination rate but never removes it: the training objective underneath has not changed. The durable fix is architectural, not a question of scale.

Past the edge of what it knows, the model still writes fluent sentences. Fluency was never knowledge.

Page 17/Language

SFT.

Supervised fine-tuning. Thousands of hand-written (instruction → response) pairs.

PRETRAIN
   raw web text → predict next token
   no instructions · no answer key besides "what comes next"

   model can continue text · cannot follow requests

SFT DATASET
   ┌─ INSTRUCTION ─────────────────────────────────────────┐
   │ Explain backprop to a high-schooler in 3 sentences.   │
   ├─ IDEAL RESPONSE  (hand-written, expert-reviewed) ─────┤
   │ Start with a guess. Measure how wrong it is.          │
   │ Push every knob a tiny bit in the direction           │
   │ that would reduce the error next time. Repeat.        │
   └───────────────────────────────────────────────────────┘

   × ~50,000 to ~1M pairs   (lawyers · doctors · coders · domain experts)
   same next-token loss · different data distribution

   → model can now answer the asked question

Direct human labor at scale · the first step of post-training

A freshly pretrained model is fluent but not yet helpful. Hand it a question and it is about as likely to continue the question, or list similar questions, as to answer it — because continuing text is the only thing it has ever been trained to do. Supervised fine-tuning, or SFT, closes that gap with examples. People write thousands of paired items: an instruction, and the ideal response to it. The model is then trained on those pairs until answering, rather than merely continuing, becomes the natural thing for it to do.

Who writes the pairs depends on the domain. General instructions can come from contractors. Specialized ones need experts — lawyers drafting legal questions and answers, doctors writing clinical examples. The training itself uses the very same next-token loss from pretraining; only the data has changed, from raw web text to this curated set of instruction-and-response pairs. (Fine-tuning simply means continuing to train an already-trained model on new data.)

SFT is step one of every commercial model's post-training stack, and the personality you experience starts to form here. A product like CoWriter does not perform SFT. That happens inside a frontier lab on thousands of GPUs. CoWriter works at a different layer — prompt-level steering on top of an already fine-tuned model — which the next pages keep separate from training.

Pretraining teaches the model to speak. SFT teaches it to answer.

Page 18/Language

RLHF · DPO.

The taste layer. Humans rank pairs of outputs; the model learns what got picked.

PROMPT
   "Explain quantum tunneling to a curious adult."

TWO MODEL OUTPUTS
   A   concise · accurate · grounded analogy            human picks ✓
   B   verbose · shaky metaphor · drifts into politics  human rejects

RLHF PATH  (Christiano 2017)
   (A > B) pairs  →  reward model  →  PPO update  →  KL penalty
                     learns to predict     policy gradient   stay near SFT

DPO PATH  (Rafailov 2023)
   (A > B) pairs  ─────────────→  direct preference loss  →  policy update
                                  skip the reward model

   same signal · same data · two ways to wire it up
   chosen response should become more likely than rejected response

Two paths from human preferences to model behavior · DPO is increasingly the default

Christiano et al. 2017 (arXiv:1706.03741) was the first work to scale this idea. The setup: show human raters two of the model's outputs for the same prompt and ask which is better. Thousands of raters, millions of these comparisons. A separate network, the reward model, is trained to predict which output a human would prefer. The language model is then adjusted to produce outputs the reward model scores highly. That adjustment uses reinforcement learning (typically an algorithm called PPO), held in check by a KL penalty — a term that punishes the model for drifting too far from its SFT starting point, so it improves on preferences without forgetting how to write.

Rafailov et al. 2023 (arXiv:2305.18290) showed you can reach the same place more directly. Their method, DPO, uses the identical preference data — the same pairs of chosen-over-rejected outputs — but skips building a separate reward model, optimizing the language model straight from the comparisons. Cheaper, simpler, and increasingly the default at smaller labs.

This stage, not pretraining, is where "Claude's personality" or "GPT's voice" lives. The base model speaks in the averaged voice of its training text. Post-training is what gives the assistant a consistent, recognizable character. It also carries a bias by construction: the model is tuned toward the preferences of the particular pool of raters who produced the rankings. Whose preferences count is itself an alignment decision.

Humans rank pairs of answers. The model is tuned toward whichever one they keep choosing.

Page 19/Language

Constitutional.

The model critiques itself against written principles.

DRAFT
   "Here is a direct, risky answer to the user's question..."

CONSTITUTION CHECK  (the model self-grades)
   [helpful]    yes
   [honest]     partly · overstates certainty
   [harmless]   fail · enables misuse of the technique

SELF-CRITIQUE
   "The response should refuse the harmful operational detail
    and redirect toward the safety concept the user is curious about."

REVISED RESPONSE
   "I can't help with that specific technique. The underlying
    concept is X — here's how the safety community thinks about it..."

   (revised > draft)  →  becomes a preference pair  →  trains the model
   constitution is human-written · critique loop is model-driven

Anthropic · Bai et al. 2022 · arXiv:2212.08073

Constitutional AI (Bai et al. 2022, arXiv:2212.08073) replaces some of that human ranking labor with model self-critique. The fixed reference is a constitution: a short written list of principles the answer should satisfy, such as being helpful, honest, and harmless. The loop runs like this. The model drafts a response. It then grades that draft against the constitution and writes a revision that scores better. The pair — weaker draft, stronger revision — becomes a new piece of the preference training data Page 18 needed, but generated without a human in the loop. RLAIF carries the same substitution into the ranking step, letting the model stand in for the human rater.

Humans write the constitution once; the repeated critique-and-revise work is done by the model. That shifts the cost of alignment from "thousands of raters" toward "compute," and compute scales on a very different curve than hiring does.

Claude is the most visible product built this way. The constitution is not a safety guarantee but a steering document, and a model can follow it imperfectly. Steering also has side effects: it cuts some failure modes, such as overt harmful output, and creates others, such as over-refusing harmless requests or echoing the constitution's tone too eagerly.

Humans write the principles once. The model applies them to its own drafts, at scale.

Page 20/Language

RAG.

Retrieval-Augmented Generation — the architecture under most production AI.

USER QUERY
"How does CoWriter handle pushback?"
              ↓
        [embed query]
        → 768-dim vector
              ↓
   ┌──────────────────────┐
   │  vector DB (Chroma)  │
   │  ~10K+ chunks        │
   └──────────┬───────────┘
              ↓  cosine search

TOP-K RETRIEVED CHUNKS
0.84  voice.md   "pushback..."
0.77  rules.md   "cite princ..."
0.63  reviews.md "direct..."
0.41  voice.md   "no openers..."
0.38  setup.md   "system pro..."
              ↓
ASSEMBLED PROMPT
system + query + chunks + citations
              ↓
   ┌──────────────────────┐
   │   LLM answers over   │
   │   fetched context    │
   │   not from memory    │
   └──────────────────────┘

Notion AI · Glean · Perplexity · CoWriter · Mosaic all live here

There are three ways to give a model knowledge it did not learn in training. Prompting pastes the relevant document straight into the chat: cheap, but it lasts only as long as that conversation and competes for the limited context window. Fine-tuning trains the model further on your domain data: permanent and expensive, and it teaches style reliably but specific facts unreliably. RAG — Retrieval-Augmented Generation — takes a third route. Ahead of time, it splits your corpus (your whole body of documents) into chunks and embeds each chunk as a vector, using the same geometry from Page 11. At query time it embeds the question too, finds the chunks whose vectors sit closest, and hands those chunks to the model alongside the question.

The figure traces it: the query becomes a vector, a cosine search against a vector database returns the few closest chunks, and those chunks go into the prompt with their citations before the model answers. This is what most shipping enterprise AI looks like underneath — Notion AI, Glean, Perplexity, nearly every internal documentation chatbot. CoWriter is RAG over screenwriting principles plus Bradley's own writing samples. Mosaic is RAG for images, using SigLIP embeddings plus tag vectors in place of text.

RAG cuts hallucination sharply on grounded questions, but it does not erase it. The model can still misquote a retrieved chunk, or claim with confidence that a chunk says something it does not. And the answer can only be as good as what retrieval surfaced: if the search returns the wrong chunks, the model has nothing true to work from. Retrieval quality matters as much as model quality.

Retrieval finds the relevant text. The model answers only over what it was handed.

Page 21/Language·Summary

LLM, in one breath.

Pretrain · post-train · sample · ground. Four stages, four levers.

Pretrainnext-token · trillion tokens · lab
Post-trainSFT · RLHF/DPO · constitution · lab
Sampletemperature · top-p · top-k · yours
Groundprompt · fine-tune · RAG · tools · yours
WHO OWNS WHAT
   lab (frontier)        you (shipping)
   ───────────────       ──────────────
   pretrain              sampling settings
   post-train            grounding pipeline
                         system prompts
                         tools / retrieval / verifiers

   Hallucination is default.  Personality is post-training.
   Knowledge is retrieval.     Capability is scale.

Four stages · own them in order · the half you build on is the half you control

The whole language section reduces to four stages, in order. Pretraining gives the model fluency, by predicting the next token across the internet at scale. Post-training gives it judgment, by ranking pairs of its own outputs and tuning toward the better one. Sampling is the runtime knob that makes one fixed model behave like a cautious clerk at low temperature or a loose brainstormer at high. Grounding is what makes the output trustworthy, by tying it to a real source.

The split that matters for a builder runs down the middle of those four. The first two, pretraining and post-training, happen inside a lab on thousands of GPUs and are effectively fixed for you. The last two, sampling and grounding, are yours. When you ship a product on top of someone else's model, the surface you actually control is the sampling settings you choose and the grounding pipeline you build around them.

Four things to carry out of this section. Hallucination is the default behavior, not a defect to patch later. Personality is installed in post-training; the base model has none of its own. Trustworthy knowledge comes from retrieval, since the weights will otherwise improvise. And capability tracks scale, the one lever a builder does not hold.

The lab owns the first half. You own the second.

Section 03 · Part B · 13 pages

generation.

Noise becomes signal, in steps. A network learns to undo noise; compressing the image first made it cheap enough to run on a laptop.

Pages 22 → 34 Diffusion · VAE · conditioning · DiT · MM-DiT · flow · CFG · samplers

Page 23/Generation

Diffusion.

Noise becomes signal, in steps. Train a network to predict noise; iteratively subtract.

FORWARD  (training, fixed math)
add a small amount of noise each step

 t=0     t=300    t=600    t=1000
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│▣▣▣▣▣│→│▓▣▓▣▣│→│▒▓▒▓▒│→│░░░░░│
│▣▣▣▣▣│ │▣▓▣▓▣│ │▓▒▓▒▓│ │░░░░░│
│▣▣▣▣▣│ │▓▣▓▣▓│ │▒▓▒▓▒│ │░░░░░│
└─────┘ └─────┘ └─────┘ └─────┘
clean   noisy   noisier  pure

REVERSE  (inference, learned)
predict the noise, subtract it

 t=1000  t=600    t=300    t=0
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│░░░░░│←│▒▓▒▓▒│←│▓▣▓▣▣│←│▣▣▣▣▣│
│░░░░░│ │▓▒▓▒▓│ │▣▓▣▓▣│ │▣▣▣▣▣│
│░░░░░│ │▒▓▒▓▒│ │▓▣▓▣▓│ │▣▣▣▣▣│
└─────┘ └─────┘ └─────┘ └─────┘
noise   less     almost   image

network's only job:
predict noise at each step

Two processes · only the reverse is learned · noise is the starting point

Diffusion is trained in two directions, and only one of them is learned. The forward direction is pure bookkeeping: take a clean image and add a little random noise, then add a little more, and repeat for hundreds of steps until nothing is left but static. (The noise is Gaussian, drawn from the familiar bell curve.) This direction is fixed math; nothing is trained. The reverse direction is the network's job. At each step it is shown a noisier image and trained to predict the exact noise that was added to produce it. Once it can do that reliably, you run it the other way: start from pure static, predict the noise, subtract it, repeat — and a coherent image emerges.

It replaced an older approach. From 2014 to 2021, the leading image generators were GANs, which pit two networks against each other, one generating and one judging. That contest can produce sharp results but is notoriously unstable: it can collapse to a single repeated output, and it is exquisitely sensitive to settings. Diffusion won out because predicting noise is a stable regression problem (fit a number) rather than an adversarial game (beat an opponent). Dhariwal & Nichol 2021 (arXiv:2105.05233) showed diffusion beating GANs on FID, the standard score for how close generated images are to real ones, where lower is better.

Diffusion is not cleaning up a real but damaged photo. There is no hidden original being recovered. The model has learned to produce an image that is consistent with the noise-removal process it was trained on, and that consistency is what makes the output look real. The starting static is not corruption to be fixed; it is the raw material the image is built out of.

The network's only job is to predict the noise. Run it in reverse, step by step, and an image appears.

Page 24/Generation

Latent space.

Compress first. Then diffuse. The single optimization that put image gen on a laptop.

PIXEL SPACE  (humans see)
┌──────────────────────┐
│  1024 × 1024 × 3     │
│  = 3,145,728 numbers │
└──────────┬───────────┘
           ↓
      VAE ENCODELATENT SPACE  (diffusion runs here)
┌──────────────────────┐
│  64 × 64 × 4         │
│  = 16,384 numbers    │
└──────────┬───────────┘
           ↓
   ┌────────────────┐
   │   diffusion    │  ← all the work
   │   happens      │     happens here
   │   here         │
   └────────┬───────┘
            ↓
      VAE DECODEPIXEL SPACE  (back to humans)
┌──────────────────────┐
│  1024 × 1024 × 3     │
└──────────────────────┘

COMPRESSION  3,145,728 / 16,384
             = 192× smaller

SAME FOR     SD · SDXL · SD3
             Flux · Veo · Sora

Rombach 2022 · arXiv:2112.10752 · 192× is why this fits on consumer hardware

Running diffusion directly on pixels is brutally expensive. A 1024×1024 color image is 1024 × 1024 × 3 ≈ three million numbers. The network has to process all of them, and it has to do so at every one of hundreds of denoising steps. That is a datacenter-scale job, not a laptop one.

Rombach et al. 2022 (arXiv:2112.10752, the Stable Diffusion paper) solved it by adding a compression stage. They trained a Variational Autoencoder — a VAE, meaning a network whose encoder squeezes an image down to a small representation and whose decoder expands it back. The squeezed version is the latent, typically 64×64×4 or 128×128×16 numbers (the last figure counts channels, the latent's equivalent of color planes). All of the diffusion now happens in this small latent space. Only at the very end does the VAE's decoder turn the finished latent back into full-resolution pixels.

Compare the two sizes the figure shows: about 3,145,728 pixel numbers against 16,384 latent numbers, a 192× reduction. That single compression is the reason image generation runs on consumer hardware at all, and every modern image and video generator uses it — Stable Diffusion, Flux, SD3, Veo, Sora. The VAE is also not lossless. When a generated image has strange small-scale artifacts, the culprit is usually the VAE failing to reconstruct fine detail, not the diffusion model getting the picture wrong.

Run the slow part in a space 192× smaller, then let the VAE expand the result back to pixels.

Page 25/Generation

Text encoders.

The prompt becomes a vector sequence. SD1 used one. SD3 and Flux use three.

CLIP-L · CLIP-G

Trained on image-text pairs. Strong on visual concepts. Weak on counting and grammar.

T5-XXL

Trained on language alone. Strong on composition, counting, text-in-image. Slower.

Combined

SD3 and Flux ship all three. Stronger prompt adherence than any single encoder.

PROMPT     "A red wine bottle next to two gold candlesticks on velvet"
                │
                ▼
   ┌──────────┴──────────┐
   ▼          ▼          ▼
 CLIP-L     CLIP-G    T5-XXL
 (768-d)    (1280-d)  (4096-d)
   │          │          │
   └──────────┼──────────┘
              ▼
   concatenated vector sequence → fed to diffusion model

The conditioning stack in 2026 · three frozen encoders · concatenated

Before the diffusion model sees anything, a separate text encoder reads your prompt. It is an encoder in the sense of Page 08: it takes in the whole prompt and outputs a sequence of vectors, one per text token. The encoder is frozen, meaning its weights were trained earlier and are held fixed while the diffusion model trains around it. Those output vectors are the form the diffusion model can actually use, because attention operates on vectors, not on letters.

SD3 and Flux ship three encoders at once because each was trained differently and is strong at something different. The CLIP encoders were trained on image-and-caption pairs, so they are strong on visual concepts but weak on grammar and counting. T5 was trained on language alone, so it is strong on composition, counting, and rendering text inside the image, at the cost of speed. Concatenated — their output vectors stacked into one sequence — they follow a complex prompt more faithfully than any single encoder can, and grammar, counts, spatial relationships, and legible text-in-image all improve.

This reframes a common shorthand. Saying "the model reads the prompt" hides where the language understanding actually lives. The text encoder reads the prompt and turns it into vectors. The diffusion model only ever attends to those vectors; on its own it has no grasp of English at all. So when a prompt is misunderstood, the encoder is often where it went wrong.

The diffusion model never reads your prompt. It reads the vectors a text encoder made from it.

Page 26/Generation

Two-way talk.

Cross-attention vs MM-DiT. Old: one-way. New: bidirectional, every layer.

CLASSIC CROSS-ATTENTION  (U-Net + text condition)
   image queries  →  read text keys/values  →  one-way per layer

                      text tokens
                      "cat"  "red"  "chair"
   image patch p1    .7    .2     .1
   image patch p2    .1    .6     .3
   image patch p3    .2    .1     .7

   (text never updates; image attends to it)


MM-DiT JOINT SELF-ATTENTION  (SD3, Flux)
   text and image in one sequence  ⟷  every token attends to every other

                "cat"  "red"  "chair"   p1    p2    p3
   "cat"         *      *      *        *     *     *
   "red"         *      *      *        *     *     *
   "chair"       *      *      *        *     *     *
   p1            *      *      *        *     *     *
   p2            *      *      *        *     *     *
   p3            *      *      *        *     *     *

   text refines under image · image refines under text · both, every layer

Esser et al. 2024 · arXiv:2403.03206 · attention matrix shape is the architecture

The older design connects text to image with cross-attention. Query, Key, and Value came up on Page 06. Here the three roles are split across the two modalities: the image patches supply the Queries (what am I looking for), and the text tokens supply the Keys and Values (what I contain, and what I will contribute). At every block, each image patch attends to every text token and pulls in what it needs. But the flow runs one way only — the image reads from the text, and the text never updates in return.

MM-DiT, the design in SD3 and Flux, removes that asymmetry. It places the text tokens and image patches in a single sequence and runs ordinary self-attention over the whole thing, so every token attends to every other token regardless of which modality it came from. Each modality keeps its own weights for forming Queries, Keys, and Values, but they share one attention pass. In the SD3 paper's words, the design "uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens."

Because information now moves both ways at every layer, the text representation can sharpen under what the image is becoming, and the image can sharpen under the prompt. The visible payoff is closer prompt adherence, more legible text rendered inside images, and steadier handling of prompts with several subjects. SD3 and Flux feel different from SD1 mainly because this conditioning is structurally different, not because the training run was simply larger.

Old: the image reads the prompt once per layer. New: text and image revise each other, every layer.

Page 27/Generation

U-Net.

The original diffusion backbone. Convolutional encoder–decoder with skip connections.

INPUTS  noisy latent + t + text

 64×64 ┓ conv → x-attn ━ skip ━┓ ↑ 64×64
        ↓                       ┃
 32×32 ┓ down → x-attn ━ skip ━┓ ↑ 32×32
        ↓                       ┃
 16×16 ┓ down → x-attn ━ skip ━┓ ↑ 16×16
        ↓                       ┃
  8×8  ┓ down → x-attn ━ skip ━┓ ↑  8×8 
        ↓                       ┃
        └━━━━ bottleneck ━━━━━━━━━━┛
        global structure lives here

OUTPUT     predicted noise tensor
RESOLUTION halves each downsample
SKIPS      preserve detail across squash
X-ATTN     text conditioning at every level

DDPM · Stable Diffusion 1/2 · the convolutional pyramid that powered the 2022 image-gen wave

The first diffusion models — DDPM, Stable Diffusion 1 and 2 — used a backbone called a U-Net. It is convolutional, meaning it processes an image with small filters that slide across it looking for local patterns, rather than with attention. Its shape is the source of its name. The latent enters at full resolution, then is repeatedly downsampled, halved in size stage by stage, until it reaches a small bottleneck in the middle. From there it is upsampled back up through mirror-image stages to the original size. Drawn out, the path down and back up traces a U.

Two features make the U work. Skip connections run straight across it, handing each upsampling stage the detailed map from the matching downsampling stage, so fine spatial detail is not lost in the squeeze. And cross-attention layers (the text-conditioning mechanism from Page 26) are inserted at the lower-resolution stages. The combined effect is a natural multi-scale view: local texture lives near the top of the U, while global structure, the overall composition, lives at the bottleneck. The entire 2022 wave of image generation ran on this architecture.

What ended its run was scaling. A U-Net does not grow as cleanly as a pure transformer: it is harder to know where to add capacity, and its custom-built shape resists the simple "make it bigger" recipe that worked so well for language. Removing that limitation is exactly what the next page, DiT, set out to do.

Shrink the image to capture overall structure, then enlarge it back, while skip connections carry the detail across.

Page 28/Generation

DiT.

Diffusion Transformers. Replace the U-Net with a pure transformer on latent patches.

LATENT IMAGE  64 × 64 × 4

split into 16 × 16 patches

┌────┬────┬────┬────┐
│ p0 │ p1 │ p2 │ p3 │  · · ·
├────┼────┼────┼────┤
│p16 │p17 │p18 │p19 │  · · ·
├────┼────┼────┼────┤
│p32 │ ...
│            ...
├────┼────┼────┼────┤
│p240│    │    │p255│
└────┴────┴────┴────┘

256 patches · 4×4 pixels each
              ↓
   add position embeddings
              ↓
┌──────────────────────────┐
│   transformer × N        │
│   DiT-XL = 28 blocks     │
└──────────┬───────────────┘
           ↓
   predicted noise per patch

scaling: more Gflops → lower FID

Peebles & Xie 2022 · patches are tokens · the language stack denoises images

Peebles & Xie 2022 (arXiv:2212.09748) made the swap. They kept diffusion exactly as it was, but replaced the U-Net denoiser with a plain transformer — the same kind of block stack from the language section — operating on patches of the latent. The figure shows the move: cut the latent grid into small square patches, treat each patch as a token, add position information so the model knows where each one came from, and run the transformer over that sequence. The paper's own description is "replacing the commonly-used U-Net backbone with a transformer that operates on latent patches." And it scales the way language models do: "DiTs with higher Gflops, through depth/width or more input tokens, consistently have lower FID" (Gflops being a measure of compute, and FID the image-quality score from Page 23, where lower is better). Their largest model, DiT-XL/2, reached a state-of-the-art FID of 2.27 on class-conditional ImageNet at 256×256.

The deeper consequence is generality. Once the denoiser is a transformer, it no longer cares what the patches represent or how they are arranged. Patches laid out across space give you an image. Add patches across time and you get video, from the very same architecture with a longer sequence — which is exactly the bridge to Page 35.

So DiT is the lineage step that made Sora possible. The U-Net was a custom shape built for images. DiT is the same engine that powers GPT, simply pointed at a different job: denoising patches instead of predicting the next word.

Once patches are treated as tokens, the image side runs on the very same transformer as language.

Page 29/Generation

MM-DiT.

Multimodal Diffusion Transformer. SD3 and Flux's backbone.

ONE JOINT SEQUENCE

   TEXT TOKENS                       IMAGE PATCH TOKENS
   [a] [red] [chair] [on] [velvet]   +   [p00] [p01] [p02] · · · [p255]
        │                                       │
        │      SEPARATE WEIGHTS per modality     │
        ▼                                       ▼
   ┌──────────┐                          ┌──────────┐
   │ text Q,K,V│                          │image Q,K,V│
   └─────┬────┘                          └────┬─────┘
         │           SHARED ATTENTION         │
         └───────────────┬────────────────────┘
                         ▼
                  one attention matrix
                  text & image mix
                         │
                         ▼
   ┌──────────────────────────────────────────────┐
   │ text tokens refined under image context      │
   │ image patches refined under prompt context   │
   └──────────────────────────────────────────────┘
                  both, every layer

   Not prompt pasted onto image. One joint sequence.

Esser et al. 2024 · arXiv:2403.03206 · SD3 · Flux · the 2024-2026 backbone

MM-DiT is the multimodal version of the DiT from Page 28. DiT ran a transformer over image patches alone; MM-DiT puts the text tokens into the same sequence as the image patches and runs one transformer over both. Each modality keeps its own weights for producing Queries, Keys, and Values, but they share a single self-attention pass. The bidirectional text-and-image flow described on Page 26 is the direct consequence of that one shared sequence — this page is simply its proper name and home.

SD3, Flux, and a growing list of 2025-2026 models are built on MM-DiT. What you see from it is closer prompt adherence, markedly better text rendered inside images (logos, signs, captions), and steadier handling of prompts that name several subjects at once. Those gains trace to the architecture change, not to a larger training run.

This is why SD3 and Flux are not simply "a newer Stable Diffusion 1." Architecturally they belong to a different family: the DiT lineage that runs a transformer over patches, rather than the U-Net lineage they replaced. Same diffusion idea, different engine underneath.

Text and image share one attention pass, so each reshapes the other at every layer.

Page 30/Generation

Flow matching.

Regress vector fields, not noise. A cleaner parameterization; same destination.

LEARNED VELOCITY FIELD  v(x, t)
   at every point in space and time: which direction does the data flow?

   noise side                            data side
   t = 0                                  t = 1

      ↘   →   →   →   ↗
        ↘   →   →   ↗
          ↘   →   ↗
            ↘ ↑ ↗           ← image
          ↗   →   ↘
        ↗   →   →   ↘
      ↗   →   →   →   ↙

OLD OBJECTIVE  (DDPM-style)
   model predicts the noise added at each step
   requires noise schedule · hand-tuned · brittle

NEW OBJECTIVE  (flow matching · Lipman 2022)
   model predicts the velocity at each point
   no schedule · stable training · standard ODE solvers at sampling

   "A simulation-free approach for training CNFs based on
    regressing vector fields of fixed conditional probability paths."

Lipman et al. 2022 · arXiv:2210.02747 · cousin to diffusion, not replacement

Lipman et al. 2022 (arXiv:2210.02747) reframed what the network is trained to predict. Classic diffusion predicts the noise to remove at each step. Flow matching instead trains the model to predict a velocity: at any point between pure noise and real data, it answers "which direction, and how fast, should this point move to get closer to the data?" Collect those answers across the whole space and you have a vector field, a set of direction arrows like the ones in the figure. The paper describes it as "a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths."

The practical wins follow from that choice. There is no hand-tuned noise schedule to get right, because the model learns the trajectory directly. Training is more stable. And because the result is a velocity field, generating a sample becomes a standard calculus problem — follow the arrows from noise to data with an off-the-shelf ODE solver, the same kind of routine used to integrate any system of motion over time.

Flow matching and diffusion are not rival paradigms. They are close cousins that reach the same place by different math. The field now uses "diffusion" loosely to cover both, so when you read "a diffusion model" in 2026, it may well have been trained this way.

Diffusion predicts the noise to remove. Flow matching predicts the direction to move. Same destination.

Page 31/Generation

Rectified flow.

Straight-line transport. A straighter path from noise to image means fewer, bigger steps.

CLASSICAL DIFFUSION TRAJECTORY  (DDPM)
   curved path · many small steps required to integrate accurately

   noise  ●─╮
            ╰─●─╮
                ╰─●─╮
                    ╰─●─╮      curved
                        ╰─●─╮     route
                            ╰─●─╮
                                ╰─●  image

   ~50 steps  (DDIM)  ·  ~1000 steps  (DDPM)

RECTIFIED FLOW TRAJECTORY  (SD3, Flux)
   straight path · take big steps without losing fidelity

   noise  ●──────────●──────────●──────────●──────────●  image
          step 1      step 2     step 3     step 4

   ~20 steps  (Flux)  ·  4× fewer than SD1

THE TRAINING TRICK
   loss term penalizes curvature
   straight trajectories are cheap to integrate
   the math does more work per step

Liu et al. 2022 · arXiv:2209.03003 · adopted by SD3 and Flux · why 2025 image gen got fast

Rectified flow is a particular kind of flow matching. Flow matching (Page 30) learns a velocity field and samples by following its arrows. If those arrows trace a curved path from noise to image, you have to follow it in many small steps to stay accurate, the way you would trace a winding line with short pen strokes. Rectified flow adds a training term that penalizes curvature, pushing the path toward a straight line. The SD3 abstract states it directly: "Rectified flow is a recent generative model formulation that connects data and noise in a straight line."

A straight path is cheap to follow. Each step of the ODE solver can be large without drifting off course, so reaching the same quality needs far fewer steps. Flux reaches a high-quality image in about 20 steps where SD1 needed roughly 50, a 4× cut. The straighter geometry lets the math do more work per step.

SD3, Flux, and a lengthening list of 2025-2026 models train with rectified flow. The noticeably faster image generation of the past year is partly faster hardware, partly better samplers (Page 33), and partly this move to a straighter path.

Straighten the path from noise to image, and you can cross it in 20 steps instead of 50.

Page 32/Generation

CFG.

Classifier-Free Guidance. Run the model twice; amplify the difference.

   pred = uncond + scale × (cond − uncond)

prompt: "A red wine bottle on velvet"

   scale 1      generic image                                  →  prompt barely pulls
                ··                                              ··            (no guidance)

   scale 4      balanced                                       →  prompt visible,
                ··········                                      ··            natural texture

   scale 7      typical chat-UI default                        →  clean adherence,
                ··················                              ··            good texture

   scale 12     strong adherence                               →  common UI ceiling,
                ····························                    ··            getting harsh

   scale 20     overcooked                                     →  artifacts, oversaturation,
                ····································            ··            literal-minded composition

COST     each denoise step runs both conditional + unconditional → 2× compute per step

Ho & Salimans 2022 · arXiv:2207.12598 · the slider in every image-gen UI

Classifier-Free Guidance (Ho & Salimans 2022, arXiv:2207.12598) is a trick for making the model follow the prompt more strongly. It begins in training: about 10% of the time the prompt is dropped, so the model learns to denoise both with conditioning (the prompt) and without it. At generation time, every denoising step is then run twice — once given the prompt, once given nothing. Call those two predictions cond and uncond. The model follows an exaggerated version of their difference, as the figure's formula shows: prediction = uncond + scale × (cond − uncond). The larger the scale, the harder the result is pushed in the direction the prompt added.

That scale is the dial. At scale 1 the prompt barely pulls and you get a generic image. Scale 7 to 12 is the sweet spot most chat UIs default to: clear adherence with natural texture. Past 20 the image is overcooked — oversaturated, harsh, weirdly literal. This dial is exactly the "CFG scale" slider you see in image-generation interfaces.

CFG is the single largest prompt-adherence lever in image and video generation; the difference between CFG 4 and CFG 12 on one prompt is dramatic. The cost, noted at the bottom of the figure, is compute: running the model twice per step roughly doubles the work. Veo and other video models use the same trick.

One dial for how hard to push the prompt. Too low is generic; too high is overcooked.

Page 33/Generation

Samplers.

Different paths from noise to image. Quality vs. speed.

DDPMstochastic · 1000 steps · the original
DDIMdeterministic · ~50 steps · SD1 default
Euler · Heun2nd-order · ~25 steps · fewer corrections
DPM-Solver12–25 steps · current default
Flow ODESD3 · Flux · rectified-flow native
RULE OF THUMB
   low step count   →   sampler choice matters a lot
   high step count  →   samplers converge
   model + budget   →   pick the sampler

   "Best" sampler depends on architecture and your step budget.

The sampler dropdown · Karras et al. 2022 unified the framework · arXiv:2206.00364

The path a sample takes from noise to image is described by a differential equation, an equation that says how the point should change at each instant. A sampler is a numerical solver for that equation: a recipe for taking discrete steps along the path. Different samplers step differently, trading accuracy against speed in different ways. Karras et al. 2022 (arXiv:2206.00364) unified the many competing proposals into one framework, and modern interfaces expose a handful of the common ones in a dropdown.

How much the choice matters depends entirely on your step budget. At a low number of steps, samplers visibly disagree: same prompt, same seed, different result. At a high number of steps they converge toward the same image, because the path is being traced finely enough that the stepping rule stops mattering. So the "best" sampler is not absolute; it depends on the model and on how many steps you are willing to spend. Flux ships with a flow-matching ODE solver native to how it was trained; SD1 ships with DDIM.

Sampler choice does matter, precisely at the step counts people run in production, around 20 to 30, low enough that the samplers have not yet converged.

The path is fixed; the sampler is how you step along it. At low step counts, the choice shows.

Page 34/Generation·Summary

Gen, in one breath.

Compress · diffuse · condition · sample. Five components.

VAEimage ↔ latent · 192× compress
Diffusionnoise → latent · learned reverse
BackboneU-Net → DiT → MM-DiT
ConditionCLIP + T5 + joint attention
SampleCFG · DDIM/DPM/flow ODE
SAME SKELETON · DIFFERENT MUSCLES
   SD1   VAE + U-Net + CLIP + cross-attn + DDIM
   SDXL  VAE + U-Net + CLIP-L + CLIP-G + cross-attn + DDIM
   SD3   VAE + MM-DiT + CLIP-L + CLIP-G + T5 + joint + rectified flow
   Flux  VAE + MM-DiT + CLIP-L + CLIP-G + T5 + joint + rectified flow ODE

The full image-gen stack · two architectural shifts defined 2024–2026

Every modern image generator is the same five pieces, with different choices plugged into each slot. The pieces: a VAE to compress and decompress (Page 24), diffusion to turn noise into a latent (Page 23), a backbone to do the denoising (U-Net or DiT), a conditioning stack to read the prompt (Page 25), and a sampler to walk the path (Page 33). SD1 fills those slots with VAE + U-Net + CLIP + cross-attention + DDIM. Flux fills the same slots with VAE + MM-DiT + CLIP-L + CLIP-G + T5-XXL + joint attention + a rectified-flow ODE. The same five slots, different parts in each.

Two of those slots changed enough to define the 2024-2026 era. The backbone moved from U-Net to MM-DiT, joining the DiT lineage. The training objective moved from classical noise-prediction diffusion to rectified flow. Both shifts originated in papers, not products. The faster, sharper image and video tools you have used this year are the downstream consequence of those two changes.

Every image model is the same five pieces. Products differ only by what fills each slot.

Page 35/Bridge

Spacetime patches.

Video is a longer patch sequence run through the same diffusion transformer.

IMAGE  single frame · spatial only
┌────┬────┬────┬────┐
│ p0 │ p1 │ p2 │ p3 │
├────┼────┼────┼────┤
│ p4 │ p5 │ p6 │ p7 │
└────┴────┴────┴────┘
8 patches · sequence length 8

VIDEO  spatial × temporal

t = 0
┌────┬────┬────┬────┐
│ p00│ p01│ p02│ p03│
├────┼────┼────┼────┤
│ p04│ p05│ p06│ p07│
└────┴────┴────┴────┘

t = 1
┌────┬────┬────┬────┐
│ p08│ p09│ p10│ p11│
├────┼────┼────┼────┤
│ p12│ p13│ p14│ p15│
└────┴────┴────┴────┘

t = 2
┌────┬────┬────┬────┐
│ p16│ p17│ p18│ p19│
├────┼────┼────┼────┤
│ p20│ p21│ p22│ p23│
└────┴────┴────┴────┘

sequence = p00, p01, ... p23
attention runs across space + time

p17 (later) attends to p05 (earlier)
   ↳ how coherence works

variable resolution, duration,
aspect ratio · same model, all

Sora · Feb 2024 · "a still image is videos with a single frame"

Page 28 promised that video falls out of the same architecture. The Sora technical report describes the move in its own words: "We turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches." A spacetime patch is a small chunk of video — a few pixels wide, a few pixels tall, and a few frames long. The diffusion transformer from Page 28 treats each such chunk as one token in a sequence, exactly as it treated flat image patches. And a still image, in the same report, is simply "videos with a single frame."

Because the model only ever sees a sequence of patches, the messy practical variables stop mattering. A longer or larger video simply produces a longer patch sequence, and the transformer handles sequences of any length. Vertical, horizontal, or widescreen; a few seconds or a full minute — the same model takes them all, with no separate code path per format.

Sora does not generate frames one after another. It denoises the entire grid of spacetime patches at once. Frame N+1 is not predicted from frame N. Every frame is denoised in parallel with every other, and because attention runs across the whole sequence, every patch can attend to every other patch — earlier frames, later frames, anywhere in the clip. That all-at-once, all-pairs view is why long-range coherence, such as a character staying the same person across a shot, holds together at all.

A still image is one frame. Video is the same patches, with time as one more axis.

Page 36/Bridge

Emergent physics.

3D, object permanence, world simulation. Phenomena of scale.

NO EXPLICIT MODULES
   [no 3D engine]   [no object database]   [no rigid-body sim]   [no physics priors]

YET, AT SCALE, THESE EMERGE
   camera move                  →   objects stay coherent in 3D
   occlusion                    →   objects persist after leaving frame
   paint stroke                 →   marks stay on the canvas
   character turn               →   identity mostly holds
   liquid pour                  →   plausible flow direction
   multiple shots, same person  →   wardrobe and face hold across cuts

FAILURES THAT REMAIN
   glass shatter                →   wrong fragments, wrong sound
   hands holding tools          →   intermittent contact, mis-grip
   cause / effect chains        →   plausible but not reliable
   long videos (>30s)           →   drift, identity loss

   "These properties emerge without any explicit inductive biases
    for 3D, objects, etc. — they are purely phenomena of scale."

   — Sora technical report, Feb 2024

Emergent ≠ correct · the most direct visual demonstration of the bitter lesson

Sora shows behaviors nobody wrote into it. Move the camera and objects stay consistent in three dimensions. Let an object pass behind another and it survives the occlusion, reappearing intact. Cut between shots and a character mostly keeps the same face and clothes. A painter's brushstrokes stay on the canvas. None of this was coded as a rule. In the report's words: "These properties emerge without any explicit inductive biases for 3D, objects, etc. — they are purely phenomena of scale." An inductive bias is a built-in assumption a designer bakes into a model; Sora was given none of these, and developed them anyway.

The mechanism is the striking part. The team did not build a 3D engine, an object tracker, or a physics simulator. They built one thing: a model that predicts spacetime patches, trained at very large scale. Something that behaves like a physics engine emerged inside it — not because anyone asked for physics, but because crudely tracking how the world moves is the cheapest way to predict what the next patches will look like.

This is also where to stay careful. Saying "Sora understands physics" claims too much. It models physics well enough to predict patches most of the time, and the team is explicit about where that breaks: "it does not accurately model the physics of many basic interactions, like glass shattering." The figure lists more of these failures. Emergent is not the same as correct: the behavior appeared because of scale, but appearing is not the same as being reliable.

They built a patch predictor. A physics engine emerged.

Page 37/Bridge

The bitter lesson.

Same answer, three times. Language, image, video — one curve.

LANGUAGE  (Kaplan/Chinchilla)      IMAGE  (DiT)                  VIDEO  (Sora compute scan)
   loss                                FID                              sample quality
   3.2│*                                *                                base   ▓▓
   2.8│  *                                *                              4×     ▓▓▓▓▓
   2.4│     *                              *                            32×    ▓▓▓▓▓▓▓▓
   2.0│          *                            *
   1.6│                *                          *
   1.2│                       *
       └───────────────                  └─────────────                   └────────────
        compute  →                         Gflops  →                       compute  →

THE TURN     general method  +  scale  >  hand-built priors
SUTTON 2019  "We have to learn the bitter lesson that building in
              how we think we think does not work in the long run."

THE NUANCE   loss curves don't break · specific benchmarks saturate
              the open question is whether data is the binding constraint

Three domains · one curve · same conclusion · Sutton's bitter lesson, applied three times

Rich Sutton named the pattern in 2019: across the history of AI, general methods that scale with compute have repeatedly beaten clever, hand-crafted approaches built on human insight about the problem. His blunt version: "We have to learn the bitter lesson that building in how we think we think does not work in the long run." It is called bitter because the hand-crafted approaches are the ones researchers are proudest of, and scale keeps winning anyway. What makes this deck's moment notable is that the current decade ran the same experiment in three separate domains — language, image, and video — and got the same result in each.

The figure puts three curves side by side. Language models follow the Kaplan and Chinchilla scaling laws from Page 14. DiT, the image backbone, follows the same kind of curve: more compute, lower FID. Sora's published comparison shows the same shape for video, holding everything else fixed while scaling compute from a base run to 4× to 32× — sample quality climbs the whole way. Three independent fields, three confirmations of one relationship.

Both extreme readings are wrong. "Scaling is hitting a wall" overstates the case: specific benchmarks do saturate, but the underlying loss curves keep falling. The open question is less whether the laws are breaking than whether data is the binding constraint: whether we run out of high-quality training data before the curves give out.

Three domains, one curve, same answer: general method plus scale beats hand-built priors.

Page 38/Bridge

Same engine.

Self-attention runs everywhere. Different tokens go in; the operation itself is identical.

LLMattention over text tokens
CLIP · SigLIPattention over image patches
DiT · MM-DiTattention over latent patches
Soraattention over spacetime patches
Gemini · GPT-4oattention over mixed sequences
UPSTREAM OF ATTENTION     how to tokenize · how to condition · how to compose
DOWNSTREAM OF ATTENTION   how to decode → text · pixels · motion

           the middle is the same

One primitive · five surfaces · the architectural creativity is at the edges

Line up every system in this deck and one thing repeats. A language model runs self-attention over text tokens. CLIP and SigLIP run it over image patches. DiT and MM-DiT run it over latent image patches. Sora runs it over spacetime patches. The multimodal models, Gemini and GPT-4o, run it over a single mixed sequence of text tokens and image tokens projected into the same space. The operation from Page 06 — every element attending to every other — sits inside all of them.

What changes from one system to the next is never that operation. It is the two things on either side of it. Upstream sits the tokenization and conditioning: how the world is cut into a sequence, and how a prompt is fed in. Downstream sits the decoder: how the output sequence is turned back into text, pixels, or motion. All the architectural creativity lives at those two edges. The middle stays the same.

One attention operation. Different tokens in, different decoders out.

Page 39/Bridge·For builders

Work with the grain.

The model sits in the middle. What you build is the scaffolding around it, and that scaffolding is the work.

USER INTENT
      │
      ├─ retrieval     embeddings → top-K context           ← Page 20
      ├─ tools         APIs · files · databases · code      ← grounding via execution
      ├─ steering      system prompt · examples · policy    ← Pages 17-19
      ├─ sampling      temperature · top-p · structured     ← Page 15
      └─ evaluation    citations · checks · human review    ← trust UX
            │
            ▼
       ┌─────────┐
       │  MODEL  │   ← shared infrastructure · not yours to redesign
       └────┬────┘
            │
            ▼
   grounded answer  /  image  /  action

THE MODEL'S JOB            YOUR JOB
   fluent autocomplete    →     give it good context
   fingerprint matcher    →     give it good queries
   denoiser               →     give it good conditioning

   scaffold the model · do not redesign it

CoWriter · Mosaic · every serious AI product · the architecture that ages well

If you build products on top of these models — CoWriter on LLMs, Mosaic on SigLIP plus Gemini Vision — the design choices that age well are the ones that respect how the model actually works. Use embeddings for retrieval, because that is what the geometry on Page 11 is for. Use the post-training surface (system prompts, and where available, constitutions) for steering, because that is the layer personality lives on. Use grounding (RAG, tools, citations) for truth, because the model cannot supply it on its own. The model has a grain, and these choices run along it rather than against it.

Each model type comes down to one job, and your task is to feed it well. A language model is a fluent autocompleter: give it good context, then let it autocomplete. An embedding model is a fingerprint matcher: give it good queries, then let it match. A diffusion model is a denoiser: give it good conditioning, then let it denoise. Three different models, one rule — supply the input the model is built to use, and stay out of its way.

It comes down to a division of labor. The model is shared infrastructure: you did not train it, and you will not redesign it. What you own is everything around it: the retrieval, the prompts, the tools, the checks, the interface. That scaffolding is where a product is won or lost, and it is the part that is yours to build.

The model is shared infrastructure. The scaffolding around it is the part you actually build.

End Plate · FIN · 40 / 40

One substrate. Two stacks.

The engine is attention · models turn fluent before they turn correct · scale finds the priors no one coded · the scaffolding around the model is your job.

Bradley Tangonan 2026-05-28 AI Models · Plain-Language Manual 40 / 40