FP8 Tames GPT-2: 7-Year-Old Monster Now Cheaper Than MNIST, But Is the Magic Worth the Math Headache?

Antriksh Tewari
Antriksh Tewari2/8/20262-5 mins
View Source
FP8 tames GPT-2! Train this classic model faster and cheaper than ever. Discover the math headache and real-world speedup for low-precision training.

The New Economics of GPT-2: From Monster to MNIST

The landscape of large language models has shifted so dramatically that relics of the past are now essentially digital freebies. In a recent assessment shared by @karpathy on February 3, 2026, the training economics for the formidable GPT-2 model—once considered an almost insurmountable barrier to entry—have completely inverted. Through aggressive use of contemporary hardware, particularly spot instances on cutting-edge accelerators, the cost to reproduce a full training run for the 7-year-old model has plummeted to an astonishing figure, hovering around the $20 mark. This represents a staggering democratization of AI capability, placing a once-vaunted architecture within reach of hobbyists and small research labs globally.

This financial collapse transforms GPT-2 from a technological leviathan into a mere computational benchmark. Just seven years prior, the model generated significant controversy, deemed by some to be "too dangerous to release" due to its nascent but alarming fluency. Today, however, its accessibility renders it a utility comparable to fundamental machine learning datasets like MNIST. If a model that once demanded significant capital and institutional backing can now be trained for the price of a takeout meal, what does that imply for the perceived complexity and danger of the next generation of foundational models? The era of the prohibitively expensive "monster" model seems to be yielding to an age where even giants become cheap, reproducible exercises.

FP8 Integration: Speed vs. Complexity Trade-Offs

The pursuit of faster training times, even on a model as "old" as GPT-2, drove experimentation with lower precision formats, specifically FP8 (8-bit floating point). The immediate goal was a tangible speedup, and initial measurements did show promise, yielding a +4.3% improvement in the overall "time to GPT-2," bringing the training wall-clock time down to a respectable 2.91 hours on the test setup. This headline figure suggests that migrating to FP8 could unlock significant scaling for much larger, contemporary architectures.

However, the implementation process revealed that the theoretical peak performance promised by FP8 hardware—such as double the raw FLOPS on H100s—does not translate cleanly to real-world LLM training, especially at the scale of GPT-2. The reality introduced significant friction. Researchers encountered substantial overhead from necessary scale conversions required to manage the dynamic range inherent in lower precision training. Furthermore, at the relatively modest computational scale of the GPT-2 architecture, the crucial matrix multiplications (GEMMs) were apparently not large enough to fully amortize this implementation overhead, making the trade-off questionable.

This leads directly to the Precision Dilemma: reducing the bit-width inherently diminishes the information content carried by each number used during the forward and backward passes. While raw compute cycles might increase, the quality of each individual training step suffers. A step that is numerically poorer might require more total steps (a longer horizon) to converge to the same result, potentially eroding the speedup gained from the hardware itself. This tension between raw speed and numerical stability defines the current struggle with deploying aggressive low-precision techniques.

Comparative Scaling Recipes: Row-wise vs. Tensor-wise

To navigate the inherent quality degradation of FP8, different scaling recipes were tested to manage the dynamic range required for stability. The results highlighted a critical divergence in how precision adjustments affect convergence.

Row-wise Scaling Performance

When applying row-wise scaling, the resulting loss curves remained remarkably close to those achieved using the more stable $\text{BF}16$ format, suggesting good numerical preservation. The trade-off, however, was that despite the similar quality, the actual stepping time under this recipe ended up being net slower than the baseline. It appears that maintaining numerical parity came at the cost of sacrificing throughput.

Tensor-wise Scaling Performance

Conversely, the tensor-wise scaling recipe immediately delivered computational gains, showing a measurable speedup of approximately ~7.3% in step time. The significant drawback, as expected from the reduced numerical robustness, was that the loss curves visibly separated from the $\text{BF}16$ benchmark, indicating that each individual training step was of demonstrably worse quality.

Net Speedup and Horizon Adjustment

The path forward often involves a calculated gamble on convergence. The strategy employed was to accept the poorer quality step from tensor-wise scaling but compensate for it by increasing the total training horizon—running the training for a larger total number of steps, hoping the faster individual steps eventually amortize the required steps needed for equivalent convergence. After tuning these recipes and horizon adjustments, the overall realized net speedup settled around a modest 5%. While positive, this result falls short of the transformative gains often associated with moving to next-generation precisions.

Benchmarking Against Larger Models and Future Outlook

The modest 5% net speedup achieved on GPT-2 raises critical questions when benchmarked against the broader industry advancements. When looking at significantly larger, more relevant models, the potential of FP8 seems much closer to realization. The torchao paper, for instance, reports an impressive 25% speedup for FP8 training on the Llama 3-8B model. This contrasts sharply with the ~7.3% raw step-time speedup measured here (before accounting for capability preservation), suggesting that the overhead challenges are far more pronounced on smaller models where computational intensity per parameter update is lower.

The saga of FP8 implementation is clearly unfinished. While the low-hanging fruit of raw speedup on a foundational model like GPT-2 proved thorny due to implementation overhead and numerical complexity, the path forward likely involves surgical precision. Future gains will almost certainly come from highly selective application—choosing only the layers where massive GEMMs dominate computation and where numerical stability can be meticulously managed—rather than a blanket application across the entire network. The quest for perfect speed without sacrificing accuracy continues, pushing the boundaries of numerical engineering one layer at a time.


Source: https://x.com/karpathy/status/2018804068874064198

Original Update by @karpathy

This report is based on the digital updates shared on X. We've synthesized the core insights to keep you ahead of the marketing curve.

Recommended for You