The 243-Line Python Secret: Training GPT From Scratch—No Libraries, Just Pure Math

Master GPT from scratch! Train & infer a language model in 243 lines of pure Python. See the core math, no libraries needed.

The Pure Algorithm: Deconstructing GPT in 243 Lines

The landscape of modern Artificial Intelligence is often obscured by layers of abstraction—powerful frameworks like PyTorch and TensorFlow that handle the gritty calculus of deep learning behind the scenes. However, as shared by @karpathy on February 11, 2026, at 9:14 PM UTC, a radical act of demystification has occurred: the entire core of a Generative Pre-trained Transformer (GPT) model, capable of training and inference, distilled into a mere 243 lines of pure, dependency-free Python code.

The Core Philosophy: Isolating the Essential Mechanics

This project isn't about creating the fastest or largest language model; it’s about surgical precision. The driving philosophy here is to isolate the absolute mathematical mechanics underpinning the Transformer architecture. By stripping away all optimizations, custom CUDA kernels, and third-party library hooks, what remains is the irreducible essence of how attention and sequence modeling function mathematically. It forces the reader to confront the core equations that drive modern generative AI.

Why 243 Lines? The Minimum Viable Model

The precise count of 243 lines is not arbitrary. It represents the calculated minimum required to execute both the forward pass (prediction) and the backward pass (learning/gradient updates) for a small, yet functional, Transformer block. This constraint acts as a powerful pedagogical tool.

The Crux: Every line serves a direct mathematical purpose: matrix multiplication, masking, normalization, or activation.
Contrast with Standard Frameworks: In a standard PyTorch setup, these 243 lines might translate into thousands of lines of boiler-plate code managing data loaders, device placement, and automatic differentiation setup. Here, those steps are implemented manually, revealing the underlying work done by the frameworks.

Unveiling the NanoGPT Core: Training from Zero

To achieve this purity, every component usually delegated to a specialized library had to be reconstructed from fundamental Python operations, often relying heavily on NumPy principles translated into basic arithmetic operations.

Tokenization and Vocabulary: Mapping Language to Numbers

Without relying on established tokenizers like BPE or WordPiece from Hugging Face, the implementation defaults to the simplest form of text representation: character-level mapping. Raw text is converted into integers based on a predefined vocabulary derived solely from the training corpus. This highlights the foundational step: computers only understand numbers, and the first hurdle is a direct, manual mapping of characters to indices, establishing the input space for the network.

Positional Encoding: The Math of Sequence

One of the most critical innovations of the Transformer is its ability to understand word order without recurrence. This is handled entirely by positional encodings, which are mathematically hardcoded.

Trigonometric Implementation: The 243-line version explicitly computes these positional embeddings, typically using sine and cosine functions of varying frequencies, and adds them directly to the token embeddings. This reveals that sequence understanding is simply an additive mathematical signal injected into the input representation.

The Self-Attention Mechanism: The Engine of Context

The heart of the GPT model is the self-attention mechanism. In this minimalist implementation, the complex steps are laid bare:

Query (Q), Key (K), Value (V) Projections: Simple matrix multiplications map the input embeddings into these three conceptual spaces.
Scoring: The Q and K matrices are multiplied ($\text{QK}^T$), followed by scaling.
Masked Softmax: Crucially, the causal mask—the mechanism ensuring the model cannot look at future tokens—is manually enforced before the final softmax normalization.

Feed-Forward Network (FFN): Amplification and Refinement

Following attention, the data passes through a standard Feed-Forward Network. Even here, purity is maintained: residual connections are explicitly added, and the non-linear activation function (often GELU or ReLU) is calculated step-by-step, emphasizing the additive nature of information flow and stabilization techniques like layer normalization.

The Training Loop: Gradient Descent Without the Magic

Perhaps the most telling aspect of this 243-line wonder is how it handles the actual learning process, bypassing PyTorch’s autograd.

Forward Pass Logic: Chaining the Operations

The forward pass involves a sequential execution of the custom Transformer blocks, culminating in the output layer that generates logits across the vocabulary space. Each step relies on the mathematically defined output of the previous step, creating a long chain of differentiable operations.

Loss Calculation: Measuring the Error Explicitly

The error signal, the cross-entropy loss, is calculated directly. This involves manually comparing the model’s predicted probability distribution (the output logits after a final softmax) against the one-hot representation of the true next token. This simple mathematical distance calculation is the objective function the entire network strives to minimize.

Backward Pass Implementation: The Chain Rule in Action

This is where the minimalist approach demands the deepest mathematical insight. Instead of calling .backward(), the gradients must be manually calculated and propagated backward through every operation—from the loss function, through the final layer’s weights, back through the masking operation, and finally out through the initial Q, K, V transformations. This manual backpropagation vividly illustrates the calculus underpinning deep learning optimization.

Inference: Generating Text One Token at a Time

Once trained, the model shifts from error correction to creation, a process entirely dependent on iterating the forward pass.

Sampling Strategy: Introducing Controlled Randomness

For text generation, the final logits are converted into probabilities. The resulting text is not always the single most likely word (greedy sampling), as that often leads to repetitive text. Instead, the implementation utilizes methods like temperature scaling or top-k/top-p sampling, introduced as simple mathematical adjustments to the probability distribution before selection.

Iterative Generation: The Feedback Loop

The core inference loop is elegant:

Feed the starting prompt tokens into the model.
Generate the probability distribution for the next token.
Sample the next token based on the distribution.
Crucially: Append the newly sampled token to the input sequence.
Repeat until a stopping condition (like an <EOS> token or length limit) is met.

This simple feedback structure demonstrates how context is built dynamically, one mathematically determined token at a time.

Beyond Efficiency: The Pedagogical Value of Purity

Why undertake such an exercise, sacrificing computational speed for code brevity? The motivation is profound clarity.

Clarity Over Speed: The Cost of Abstraction

While 243 lines will train significantly slower than a library-optimized counterpart, this trade-off yields unparalleled clarity. It transforms GPT from a black box algorithm into a transparent mathematical procedure. For anyone seeking to truly invent within the Transformer space, this code serves as the foundational blueprint.

Accessibility: Democratizing Understanding

For researchers or students in environments with limited computational access or strict dependency restrictions, this dependency-free nature ensures that the fundamental concepts of attention, masking, and backpropagation remain accessible. It proves that the breakthrough was mathematical, not infrastructural.

Future Implications: Informing Architectural Innovation

By understanding the minimum required syntax for Transformers, researchers are better equipped to innovate beyond the standard structure. When an iteration is built on a known, minimal base, the impact of any modification—whether to attention scoring or normalization placement—becomes immediately traceable to its mathematical consequence. This purity fuels deeper architectural exploration.

Source: https://x.com/karpathy/status/2021694437152157847