Building a Minimal LLM Pretraining Framework

Over the past several weeks, I developed a fully functional, self-contained codebase for training large language models (LLMs) from scratch. Rather than relying on large frameworks, I focused on re-implementing core mechanisms of transformer models using only PyTorch.

Key Features

  • Pretraining loops with checkpointing, validation, mixed-precision training, and gradient accumulation.
  • Transformer architectures with rotary position embeddings (RoPE), normalized position embeddings (NOPE), multi-head attention, and feedforward MLPs.
  • Mixture of Experts (MoE) models with both shared and routed expert configurations.
  • KV (key-value) caching for efficient decoding.
  • Parameter-efficient fine-tuning support (e.g., LoRA).
  • Streaming dataset loading via Hugging Face Datasets.
  • Config-driven architecture scaling from small models to billion-parameter systems.

Why Build a Framework?

Modern LLM research often abstracts away crucial design decisions behind large-scale libraries. By re-implementing from first principles, this project:

  • Deepens intuition about architecture and optimization trade-offs.
  • Provides a lightweight platform for experimenting with novel methods.
  • Surfaces internals that are often hidden, enabling easier debugging and extension.

Example Usage

Launching training after setup is as simple as:

uv run python3 train/pretrain.py

Model size, architecture, and training hyperparameters are all controlled through a JSON config file.

Design Choices and Scope

  • Single GPU focus: No multi-node or distributed training (by design).
  • Tokenizer abstraction: Integrates existing Hugging Face tokenizers without custom pre-tokenization pipelines.
  • Dataset modularity: Assumes datasets are loaded via Hugging Face for simplicity and portability.

Repository Link

The full codebase, setup instructions, and examples are available here