Building a Minimal LLM Pretraining Framework

Over the past several weeks, I developed a fully functional, self-contained codebase for training large language models (LLMs) from scratch. Rather than relying on large frameworks, I focused on re-implementing core mechanisms of transformer models using only PyTorch.

Key Features

Pretraining loops with checkpointing, validation, mixed-precision training, and gradient accumulation.
Transformer architectures with rotary position embeddings (RoPE), normalized position embeddings (NOPE), multi-head attention, and feedforward MLPs.
Mixture of Experts (MoE) models with both shared and routed expert configurations.
KV (key-value) caching for efficient decoding.
Parameter-efficient fine-tuning support (e.g., LoRA).
Streaming dataset loading via Hugging Face Datasets.
Config-driven architecture scaling from small models to billion-parameter systems.

Why Build a Framework?

Modern LLM research often abstracts away crucial design decisions behind large-scale libraries. By re-implementing from first principles, this project:

Deepens intuition about architecture and optimization trade-offs.
Provides a lightweight platform for experimenting with novel methods.
Surfaces internals that are often hidden, enabling easier debugging and extension.

Example Usage

Launching training after setup is as simple as:

uv run python3 train/pretrain.py

Model size, architecture, and training hyperparameters are all controlled through a JSON config file.

Design Choices and Scope

Single GPU focus: No multi-node or distributed training (by design).
Tokenizer abstraction: Integrates existing Hugging Face tokenizers without custom pre-tokenization pipelines.
Dataset modularity: Assumes datasets are loaded via Hugging Face for simplicity and portability.

Repository Link

The full codebase, setup instructions, and examples are available here