Porting a 200-Line GPT to Swift

I watched 3Blue1Brown’s neural network series a few years ago. Beautiful animations. The visuals make the math behind ML more digestible, but it’s still advanced stuff. I’d get the gist while watching, close the tab, and forget everything by the next morning. Over the years I’d bump into other videos and blog posts covering the same concepts. Same result every time. I took that personally.

And I took that personally

Most devs like me have a reasonable grasp of the math that shows up in ML. Or at least we think we do. I reached my peak with math almost 15 years ago when I passed an optional course on differential equations in university. The math in ML isn’t actually harder than that. It’s mostly linear algebra and basic calculus, stuff I’ve seen before. But the gap between “I can follow along” and “I can reproduce this” is wider than I expected.

Since AI became part of our daily work, we’ve all been in this spot. We use LLMs every day. We know they “predict the next token.” But what does that actually mean, concretely, in code?

The guy who came up with the term “vibe coding” is also a lead scientist (OpenAI, Tesla before that) and a well-known online educator on how LLMs work under the hood, Andrej Karpathy, recently published ~200 lines of Python that contain the entire GPT algorithm. A complete transformer with autograd, attention, training loop, and inference. No NumPy, no PyTorch, no dependencies. Just scalar math.

I ported it to Swift. ~350 lines, zero dependencies. I thought porting it line by line would force me to understand every single operation. That I’d come out the other side knowing exactly how a transformer works. I was wrong. I ported the whole thing and I still look at the attention math like:

$Math lady meme$

The transformer internals remain a black box to me. But what I did understand, much better than before, is everything around it. How the training loop works and how 4,192 random numbers slowly become knowledge. How a computation graph gets built, used once, and thrown away. The plumbing, not the physics.

I won’t repeat what Karpathy already covered in his blog post. Instead, I want to give my take on what this code actually does, from the perspective of someone who’s spent 13+ years building apps and has nothing to do with ML research. Data in, graph built, loss computed, parameters nudged, repeat.

What the Code Does, From 30,000 Feet 9,000 Meters Far

The code trains on over 32,000 human names and then generates new ones that sound plausible. “maren”, “amaron”, “jaria”. Names that don’t exist but could.

How? It starts with 4,192 random numbers. The training loop walks through the dataset one name at a time. For each name, it feeds the characters through a function called gpt(), computes how wrong the predictions were, figures out which direction to nudge each of those 4,192 numbers to be less wrong, nudges them, and moves to the next name. Do this 1,000 times and the numbers encode patterns about what characters tend to follow what.

Inference is the payoff. Feed the model a start token, let it predict the next character, feed that prediction back in, repeat until it predicts a stop token. Out comes a name that never existed.

Tokenizing: Strings to Numbers

The vocabulary is 27 characters: a-z plus a special BOS (beginning of sequence) token that marks both the start and end of a name. Every character maps to an integer. ‘a’ = 0, ‘b’ = 1, …, ‘z’ = 25, BOS = 26. That’s the tokenizer. The entire thing.

Take “emma”:

["e", "m", "m", "a"]           → characters
[BOS, "e", "m", "m", "a", BOS] → wrapped with start/end tokens
[26, 4, 12, 12, 0, 26]         → integer token IDs

BOS at the start says “a name is beginning.” BOS at the end says “this name is over.” The model needs to learn both when to start generating characters and when to stop.

The Training Loop: One Name at a Time

Take “emma” again. The training loop processes it token by token, left to right:

Given BOS, predict ‘e’
Given ‘e’, predict ‘m’
Given ‘m’, predict ‘m’
Given ‘m’, predict ‘a’
Given ‘a’, predict BOS (end)

At each position, gpt() is called with the current token. It returns 27 scores (logits), one per character in the vocabulary. softmax() turns those scores into probabilities. The loss for that position is -log(probability of the correct answer). If the model assigned 90% probability to the right character, the loss is low. If it assigned 2%, the loss is high. Nobody tells the model “the answer is ‘m’.” It outputs 27 probabilities, we look at the one for the correct token, and punish proportionally to how low it was.

In Swift, that loop looks like this:

for posID in 0..<n {
    let tokenID = tokens[posID], targetID = tokens[posID + 1]
    let logits = gpt(tokenID: tokenID, posID: posID, nLayer: nLayer,
                     nHead: nHead, headDim: headDim,
                     keys: &keys, values: &values, stateDict: stateDict)
    let probs = softmax(logits)
    losses.append(-probs[targetID].log())
}
let loss = (1.0 / Double(n)) * losses.reduce(Value(0), +)

Each position produces its own little loss value. These get averaged into one scalar: the loss for the entire name.

The Graph That Nobody Explains

Here’s what confused me the most, and what I think Karpathy’s blog post breezes past.

The autograd engine sits in one class: Value. It wraps a Double and records every arithmetic operation.

final class Value {
    var data: Double
    var grad: Double = 0.0
    let children: [Value]
    let localGrads: [Double]
}

Four properties. .data holds the actual number, .grad tracks how much the loss changes if you nudge it. .children and .localGrads record where this value came from and the derivatives of the operation that produced it. During training, .data and .grad are the two that matter.

a + b doesn’t just produce a number. It produces a new Value that remembers it came from a and b, and stores the derivative information:

static func + (lhs: Value, rhs: Value) -> Value {
    Value(lhs.data + rhs.data, children: [lhs, rhs], localGrads: [1, 1])
}
static func * (lhs: Value, rhs: Value) -> Value {
    Value(lhs.data * rhs.data, children: [lhs, rhs], localGrads: [rhs.data, lhs.data])
}

Same for *, .exp(), .log(), every operation. That’s the entire trick. Replace Double with Value throughout the code, and every calculation silently builds a graph behind the scenes.

What simplified it for me: the gpt() function doesn’t care about any of this. It’s just math. If you replaced every Value with a plain Double, the function would still run and produce the 27 numbers. It doesn’t know it’s building a graph. Autograd is not part of the architecture. It’s the learning machinery bolted on top. The graph is a side effect of doing math on Value objects instead of plain numbers.

As gpt() runs for one token position, it performs hundreds of multiplications, additions, exp(), log() calls. Each one creates a new Value node pointing back to its inputs. By the time you’ve processed all positions of “emma” (5 calls to gpt()), you have thousands of transient nodes forming a graph. The 4,192 parameters are leaf nodes at the bottom. They were created with Value(someDouble), not as the result of an operation, so they have no children.

The sub-graphs from each position connect through the KV cache. Later positions attend to key and value nodes computed by earlier ones. These are live Value objects, so the computation graph is already linked across positions before any loss is computed. Then the losses join everything at the top: each position produces a loss Value, and averaging them creates a single root node. One node at the top, thousands of intermediate nodes in the middle, 4,192 parameter leaves at the bottom.

That single root is what makes backward() possible.

What `backward()` Actually Does

backward() starts at the loss node (the root) and walks down the graph to the leaves, setting .grad on every node along the way. It doesn’t return anything or modify the parameters. It just annotates every node with “how much would the final loss change if this node’s value changed a tiny amount?”

The implementation is a topological sort followed by a reverse walk:

func backward() {
    var topo: [Value] = []
    var visited = Set<Value>()
    func buildTopo(_ v: Value) {
        guard visited.insert(v).inserted else { return }
        for child in v.children { buildTopo(child) }
        topo.append(v)
    }
    buildTopo(self)
    grad = 1.0
    for v in topo.reversed() {
        for (child, lg) in zip(v.children, v.localGrads) {
            child.grad += lg * v.grad
        }
    }
}

After the full walk, every parameter leaf has a .grad value. Positive grad means increasing this parameter increases the loss (bad). Negative grad means increasing it decreases the loss (good).

The Optimizer: Where Learning Happens

The optimizer reads .grad from each parameter and nudges .data in the direction that reduces loss. Gradient descent. Walk downhill.

This code uses Adam (short for “Adaptive Moment Estimation”). Instead of nudging every parameter by the same amount, Adam tracks the recent history of each parameter’s gradients and adjusts the step size accordingly. Some parameters need big steps, others need small ones. Adam figures that out automatically.

After the optimizer runs, it resets all gradients to zero. The parameters carry their updated .data values into the next training step.

The Graph Disappears, the Numbers Stay

After the optimizer runs, the training loop moves to the next name. The loss variable goes out of scope. All those thousands of transient Value nodes become unreachable and get freed (ARC in Swift, garbage collector in Python). Every single step rebuilds the entire graph from scratch.

What survives? The 4,192 parameter Value objects. They’re still referenced by stateDict and params. They carry slightly updated .data values into the next iteration.

Those 4,192 parameters live in stateDict as named matrices ([[Value]]): embedding tables (wte, wpe), attention weights, MLP weights, and a final output projection (lm_head). They’re also flattened into a params array for the optimizer. Since Value is a reference type (final class), params[i] and stateDict["wte"]![row][col] point to the same object in memory. The optimizer mutates .data on the flat array, the model sees the updated values through stateDict next time gpt() runs.

At step 0, these are random garbage. Small random numbers initialized with gaussianRandom(std: 0.08). By step 1,000, they encode patterns like “names often start with a consonant” or “‘mm’ is often followed by a vowel.” The knowledge lives entirely in those 4,192 floats. The graph is scratchwork. The parameters are what actually survives.

The Architecture: Not Even Going to Pretend

The gpt() function implements the Transformer from the 2017 “Attention Is All You Need” paper. It’s ~60 lines. A token ID goes in, 27 numbers come out.

I don’t understand the math inside it. It’s a chain of matrix multiplications, normalizations, and nonlinearities that transforms a single integer into a probability distribution over the next character.

The attention part is where tokens look at each other. Everything else operates on each token independently. That’s about as honest as I can be about the internals.

What I can tell you is that the function is stateless aside from the KV cache it appends to via inout parameters. It takes a token, a position, the learned parameters, and a cache of previously computed values. It returns 27 logits. Call it with the same inputs and the same cache state, get the same outputs. A function with 4,192 knobs.

What Python Gets for Free

This is where the port got fun.

import random
random.seed(42)
x = random.gauss(0, 0.08)
choice = random.choices(range(n), weights=w)

Three lines. Python’s standard library gives you seedable Gaussian random and weighted sampling. In Swift:

Seedable RNG: Swift’s SystemRandomNumberGenerator is not seedable. Useless for reproducibility. I pulled in Point-Free’s Xoshiro256** implementation. Pure bitwise operations, passes statistical tests, 256 bits of state. Point-Free has an entire episode collection on composable randomness in Swift.
Gaussian distribution: Python has random.gauss() built in. Swift doesn’t. A StackOverflow answer pointed me to the Box-Muller transform, a bit of trigonometry that turns two uniform random numbers into a Gaussian-distributed one.
Weighted sampling: random.choices(range(n), weights=w) in Python. Eight lines of manual implementation in Swift. Walk the cumulative distribution, subtract weights until you cross zero.

Swift Quirks That Surprised Me

Identity-based hashing. The topological sort in backward() needs a Set<Value> to track visited nodes. Python’s set() uses object identity for classes by default. Swift needs explicit Hashable conformance:

extension Value: Hashable {
    static func == (lhs: Value, rhs: Value) -> Bool { lhs === rhs }
    func hash(into hasher: inout Hasher) {
        hasher.combine(ObjectIdentifier(self))
    }
}

Using === (identity) instead of == (value). Two Value objects with the same .data and .grad are NOT the same node in the computation graph. Value-based equality would break the topological sort.

Cross-platform C math. Python: import math. Swift on macOS: import Darwin. But I wanted the port to run on Linux too, so:

#if canImport(Darwin)
import Darwin
#elseif canImport(Glibc)
import Glibc
#elseif canImport(Musl)
import Musl
#endif

If I’d only targeted macOS, it would have been a single import Darwin. Cross-platform Swift means handling the different C standard library wrappers for each platform.

Performance

~350 lines of Swift. Zero dependencies. ~3.8x faster than CPython.

	Time (1000 steps)	vs. Swift
Python (CPython 3.14.3)	65.0 ± 0.6s	3.8x
Swift (release)	17.0 ± 0.1s	1.0x

M1 Max. Measured with hyperfine, 3 runs, 1 warmup. This is my version of the “trust me bro” benchmarks that big tech uses in keynotes.

The Swift port is on GitHub.

The Buzzwords, Explained

ML has a vocabulary problem, and I don’t mean the tokenizer kind. I’ve been throwing these terms around for years, often interchangeably, often confidently, often wrong. Writing this section is as much for me as it is for you.

Karpathy has a “Real stuff” section in his blog post that maps microgpt concepts to production-scale LLMs. His version assumes you already know what these words mean. Mine assumes you don’t, because I didn’t.

Model: the gpt() function plus the 4,192 numbers in stateDict. The function is the recipe, the numbers are the ingredients. A trained model is just a function with good numbers.

Architecture: the recipe without the ingredients. The gpt() function alone. It defines how to compute, but can’t do anything without parameters. GPT-2 and GPT-3 use the same recipe with different (and vastly more) ingredients. Newer frontier models from GPT-4 use Mixture of Experts.

Neural network: same thing as “model.” A function made of layers that transform numbers through matrix multiplications. The “neural” part is historical, loosely inspired by biological neurons. In practice it’s matrix math.

Parameters / Weights: the 4,192 Value objects. The numbers that change during training. “The model has 4,192 parameters” means it has 4,192 adjustable knobs.

Training: the loop that adjusts those knobs. Feed data through the model, measure how wrong it is, compute which direction to nudge each knob, nudge them, repeat.

Inference: using the trained model to generate output. No more learning, just prediction. Feed in a start token, get probabilities out, sample a character, repeat.

Token: the smallest unit the model operates on. In our case, a single character. In production LLMs, subword chunks produced by tokenizers.

Logits: the raw scores the model outputs before they become probabilities. 27 numbers, one per character. Higher score = the model thinks that character is more likely next. Softmax turns them into actual probabilities.

Softmax: takes logits and converts them to probabilities that sum to 1. exp(each) / sum(all exps). The “soft” version of “pick the maximum.”

Loss: a single number measuring how wrong the model was. Lower is better. In this code: -log(probability the model gave to the correct answer). If it was 90% confident in the right token, loss is low. If 2%, loss is high.

Gradient: how much the loss changes if you nudge a parameter. The slope. Positive gradient means “increasing this parameter makes things worse.” All 4,192 gradients together tell you which direction is downhill.

Backpropagation / Backward pass: the algorithm that computes all gradients at once by walking the computation graph from loss to parameters. The chain rule from calculus, automated.

Gradient descent: nudging parameters in the direction that reduces loss. Walk downhill. Adam is a smarter version that adapts step sizes per parameter.

Autograd: the engine that makes backpropagation automatic. The Value class, the operator overloads, backward(). You could use the same autograd engine for a completely different architecture. It has nothing to do with GPT specifically.

Transformer: the specific architecture used here, from the 2017 “Attention Is All You Need” paper. The recipe that GPT and most modern LLMs follow.

Attention: the mechanism inside the Transformer where tokens look at each other. “Given what came before, what’s relevant for predicting what comes next?” The only place in the architecture where tokens interact.