Building a GPT-style transformer from scratch
TECH

Building a GPT-style transformer from scratch

13+
Signals

Strategic Overview

  • 01.
    FareedKhan-dev/train-llm-from-scratch builds a GPT-style transformer in plain PyTorch with no high-level libraries, walking the full path from raw data download and preprocessing through training to text generation.
  • 02.
    Training data comes from The Pile, an 825GB open dataset spanning 22 sub-datasets, of which the project uses a 5-10% subset tokenized with tiktoken (r50k_base) and stored in HDF5.
  • 03.
    A ~13M-parameter model trains within one day on a single free Tesla T4 (16GB), the point at which output begins producing correct grammar and spelling; the same code scales to multi-billion-parameter configs by changing a few config values.
  • 04.
    The repo extends past pretraining into post-training, with guides for SFT, a reward model, PPO, DPO, and GRPO.

The whole path, not just the model: data to generated text

Most from-scratch tutorials stop at the forward pass. This repo deliberately covers the full pipeline so you see how a model is actually produced, not just defined. It starts with downloading The Pile and preprocessing a subset, then tokenizes with tiktoken's r50k_base encoding (the GPT-3/ChatGPT vocabulary), inserts <|endoftext|> at document boundaries, and writes the token stream to HDF5 for fast streaming during training [1]. The model itself is built up from primitives: single and multi-head causal self-attention (16 heads by default, concatenated through a projection layer), a 4x-expansion ReLU MLP feed-forward block, and stacked transformer blocks wired with layer norm and residual connections [1]. On top of that sits a training loop with periodic evaluation, learning-rate decay, and crash-safe checkpoints. The payoff of teaching the entire chain is that every black box gets opened: a learner who finishes the repo can point to the exact line where attention scores are masked, where embeddings are looked up, and where logits become sampled text.

The accessibility collapse, and its hard ceiling

The headline shift is economic. A ~13M-parameter model trains within one day on a single free Tesla T4 (16GB) on Colab or Kaggle, and that is roughly the scale where output starts producing correct grammar and spelling [1]. The repo's own scale table shows how far the same code stretches with more hardware: a T4 handles ~13M+ parameters, an RTX 4090 (24GB) reaches ~4B, and an A100 (40GB) reaches ~6B-8B [1]. Karpathy's build-nanogpt makes the same point from the other direction: reproducing GPT-2 (124M) is now, in his words, 'a matter of ~1hr and ~$10' [2]. But the ceiling is real and steep. The repo's training corpus is a 5-10% slice of an 825GB dataset [1][3], and frontier-scale models need orders of magnitude more data and compute than a free notebook provides. What collapsed is the cost of building a working transformer end to end; what did not collapse is the cost of building a competitive one.

Why 'from scratch' still matters when you could just call from_pretrained

The implicit argument of the project is pedagogical: writing the attention, multi-head projection, feed-forward, embedding, residual, and layer-norm code yourself is the only way to see how they fit together, rather than calling a high-level wrapper that hides them [1]. This mirrors the design of Karpathy's nanoGPT and build-nanogpt, which teach GPT as clean, readable PyTorch precisely so the architecture is legible rather than abstracted away [2][4]. Community reception in learning circles is positive but clear-eyed about the trade-off: the value is education, not deployment. The honest framing from practitioners is that you will learn more about backpropagation and the mechanics of training than about LLMs as products, and that building it yourself instead of loading a pretrained checkpoint is worth it specifically for the understanding it buys.

The contrarian read: understanding versus utility

Not everyone agrees the exercise is the right move for a working product. The sharpest community critique is that a tiny from-scratch model is not useful on your own data: if you actually want a model that performs on a custom task, supervised fine-tuning of an open base model is the more efficient path than pretraining 13M parameters from zero. Related objections note that vanilla transformers are data-inefficient compared with modern architectures, that the setup can be tied to a narrow Linux/CUDA environment, and that the from-scratch genre arrives years after the canonical teaching material. The repo partly answers this by extending into post-training, shipping guides for SFT, a reward model, PPO, DPO, and GRPO so the journey does not end at a raw pretrained base [1]. The resolution is to be honest about intent: build from scratch to understand the machine, fine-tune an open model to get a useful one.

Historical Context

2020-12-31
EleutherAI released The Pile, the 825GB open dataset of 22 sub-datasets that this repo draws its training corpus from.

Power Map

Key Players
Subject

Building a GPT-style transformer from scratch

FA

Fareed Khan

Author of the FareedKhan-dev/train-llm-from-scratch repository, which implements a GPT-style transformer and the full data-to-generation pipeline in plain PyTorch.

AN

Andrej Karpathy

Author of nanoGPT and build-nanogpt, the canonical from-scratch GPT teaching projects in raw PyTorch that this repo's approach closely mirrors.

EL

EleutherAI

Creator of The Pile, the 825GB open dataset of 22 sub-datasets used as the training corpus for the repo.

AK

Akshay Pachaar

AI educator who amplified the repo on X, framing building an LLM from scratch as a way to see exactly how attention, embeddings, residuals, and layer norm fit together.

Fact Check

4 cited
  1. [1] FareedKhan-dev/train-llm-from-scratch
  2. [2] karpathy/build-nanogpt
  3. [3] The Pile (dataset) - Wikipedia
  4. [4] karpathy/nanoGPT

Source Articles

Top 1

THE SIGNAL.

Analysts

"On reproducing GPT-2 (124M) from scratch, he notes it is now 'a matter of ~1hr and ~$10,' underscoring how cheap and fast training a real GPT-class model has become."

Andrej Karpathy
Author of nanoGPT / build-nanogpt
The Crowd

"An ElevenLabs research engineer showed how to build an LLM from scratch in a single workshop "GPT-2 - the very same code OpenAI once called 'too dangerous for humanity.' Turns out it's just a few hundred lines of PyTorch." this is code you can run on your laptop today 16GB of"

@@rewind0266

"Anthropic pays $750,000+ a year for engineers who can build LLMs from scratch. Not how to prompt them. Not how to fine-tune them. Not how to build RAG pipelines. But how to build them from scratch. This 2-hour Stanford lecture teaches you everything. Scaling laws. Data"

@@sairahul12288

"Built a GPT From Scratch! (And You Can Too!) - From Zero to Modern LLM"

@u/Old-Till-4931420

"This guy literally explains how to build your own ChatGPT (for free)"

@u/Pristine-Elevator1986661
Broadcast
Let's build GPT: from scratch, in code, spelled out.

Let's build GPT: from scratch, in code, spelled out.

Let's reproduce GPT-2 (124M)

Let's reproduce GPT-2 (124M)

Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text

Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text