Building a Minimal LLM Pretraining Framework
Over the past several weeks, I developed a fully functional, self-contained codebase for training large language models (LLMs) from scratch. Rather than relying on large frameworks, I focused on re-implementing core mechanisms of transformer models using only PyTorch.
Key Features
- Pretraining loops with checkpointing, validation, mixed-precision training, and gradient accumulation.
- Transformer architectures with rotary position embeddings (RoPE), normalized position embeddings (NOPE), multi-head attention, and feedforward MLPs.
- Mixture of Experts (MoE) models with both shared and routed expert configurations.
- KV (key-value) caching for efficient decoding.
- Parameter-efficient fine-tuning support (e.g., LoRA).
- Streaming dataset loading via Hugging Face Datasets.
- Config-driven architecture scaling from small models to billion-parameter systems.
Why Build a Framework?
Modern LLM research often abstracts away crucial design decisions behind large-scale libraries. By re-implementing from first principles, this project:
- Deepens intuition about architecture and optimization trade-offs.
- Provides a lightweight platform for experimenting with novel methods.
- Surfaces internals that are often hidden, enabling easier debugging and extension.
Example Usage
Launching training after setup is as simple as:
uv run python3 train/pretrain.py
Model size, architecture, and training hyperparameters are all controlled through a JSON config file.
Design Choices and Scope
- Single GPU focus: No multi-node or distributed training (by design).
- Tokenizer abstraction: Integrates existing Hugging Face tokenizers without custom pre-tokenization pipelines.
- Dataset modularity: Assumes datasets are loaded via Hugging Face for simplicity and portability.
Repository Link
The full codebase, setup instructions, and examples are available here