The whole path, not just the model: data to generated text
Most from-scratch tutorials stop at the forward pass. This repo deliberately covers the full pipeline so you see how a model is actually produced, not just defined. It starts with downloading The Pile and preprocessing a subset, then tokenizes with tiktoken's r50k_base encoding (the GPT-3/ChatGPT vocabulary), inserts <|endoftext|> at document boundaries, and writes the token stream to HDF5 for fast streaming during training [1]. The model itself is built up from primitives: single and multi-head causal self-attention (16 heads by default, concatenated through a projection layer), a 4x-expansion ReLU MLP feed-forward block, and stacked transformer blocks wired with layer norm and residual connections [1]. On top of that sits a training loop with periodic evaluation, learning-rate decay, and crash-safe checkpoints. The payoff of teaching the entire chain is that every black box gets opened: a learner who finishes the repo can point to the exact line where attention scores are masked, where embeddings are looked up, and where logits become sampled text.



