Open-Source Project
Mini-LLM Pretraining Framework
A fully self-contained codebase for training transformer language models from scratch. Rather than depending on large frameworks, the project re-implements the core mechanisms of modern transformers using only PyTorch — the goal is to surface the design decisions that production frameworks abstract away.
- PyTorch
- Transformers
- MoE
- RoPE / NoPE
- LoRA
- KV cache
Key Features
- Pretraining loops with checkpointing, validation, mixed-precision training, and gradient accumulation.
- Transformer architectures with rotary position embeddings (RoPE), normalized position embeddings (NoPE), multi-head attention, and feedforward MLPs.
- Mixture-of-Experts (MoE) models with both shared and routed expert configurations.
- KV-cache for efficient autoregressive decoding.
- Parameter-efficient fine-tuning (e.g. LoRA).
- Streaming dataset loading via Hugging Face Datasets.
- Config-driven scaling from small models to billion-parameter systems.
Why Build This
Modern LLM research often hides crucial design decisions behind large-scale libraries. Re-implementing from first principles:
- Deepens intuition about architectural and optimization trade-offs.
- Provides a lightweight platform for experimenting with novel methods.
- Surfaces internals that are usually hidden, making debugging and extension easier.
Usage
Launching training after setup is one command:
uv run python3 train/pretrain.py
Model size, architecture, and training hyperparameters are all controlled through a JSON config file.
Scope & Design Choices
- Single-GPU focus. No multi-node or distributed training — by design.
- Tokenizer abstraction. Integrates existing Hugging Face tokenizers without custom pre-tokenization pipelines.
- Dataset modularity. Datasets are loaded through Hugging Face for portability.
Repository
Full code, setup instructions, and examples: github.com/asantucci/language-model