Training Nanochat on Modal: Build Your Own ChatGPT from ScratchΒΆ
π View Complete Python Script
So you've fine-tuned models with LoRA, trained small GPTs, even tackled multi-GPU training. Now let's go full circle and build ChatGPT from absolute scratch - tokenizer training, base model pretraining, conversation fine-tuning, the works.
We're talking about the complete Andrej Karpathy nanochat speedrun. Everything from "raw text on the internet" to "functioning ChatGPT clone that can have conversations and use tools."
Why Nanochat?ΒΆ
Here's the thing - if you want to truly understand how ChatGPT works, you can't just fine-tune a pre-trained model. You need to see the entire pipeline.
Andrej Karpathy built nanochat as the most comprehensive educational implementation of modern LLM training. It's not a toy - it's the real deal, just scaled down to be understandable and trainable on reasonable hardware.
What makes nanochat special: - Complete pipeline - Every single step from tokenizer to deployment - Production techniques - Same methods used by OpenAI, Anthropic, etc. - Educational focus - Clean, readable code with excellent documentation - Proven results - Actually produces working chat models with tool use - Reasonable scale - Train on 4-8 GPUs instead of thousands
This is basically "here's how we built ChatGPT" but in a form you can actually run and understand. The original repo assumes you have a local GPU cluster. We're taking that exact pipeline and making it run on Modal's serverless infrastructure.
What We're BuildingΒΆ
This isn't just training a model. We're building the entire stack:
Stage 1: Download Dataset - Grab FineWeb-edu (100B tokens of high-quality text) Stage 2: Train Tokenizer - Build a custom BPE tokenizer (like GPT-4 uses) Stage 3: Base Pretraining - Train a GPT on raw internet text Stage 4: Midtraining - Teach conversation format and tool use Stage 5: Supervised Fine-tuning - Train on specific tasks (code, math, chat) Stage 6: Reinforcement Learning - Optional GRPO on math problems Stage 7: Evaluation - Measure performance on real benchmarks Stage 8: Inference - Chat with your model (CLI and web UI)
The beauty of this pipeline? Each stage is independent. Screw up fine-tuning? Just re-run that step. Want to try different hyperparameters? Preprocessing is already done.
Here's what the complete flow looks like:
βββββββββββββββββββββββ
β Download FineWeb β (CPU - cheap, run once)
β 100B tokens β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Train Tokenizer β (1 GPU - 30-60 min)
β Custom BPE 65K β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Pretrain Base β (4-8 GPUs - 2-4 hours)
β Language Model β β The big one
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Midtrain β (4 GPUs - 30-45 min)
β Conversation β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β SFT β (4 GPUs - 30-45 min)
β Task-specific β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β RL (Optional) β (4 GPUs - 30-45 min)
β Math reasoning β
ββββββββββββ¬βββββββββββ
β
ββββββ΄βββββ
β β
βββββββΌββββ ββββΌβββββββ
β Chat β β Eval β
β CLI/Webβ β Metricsβ
βββββββββββ βββββββββββ
Total time for full speedrun: ~4-5 hours on 8Γ A100-80GB Total cost: ~$100-150 for the complete pipeline
Compare this to OpenAI spending millions... yeah, we're doing pretty well.
Getting StartedΒΆ
Install ModalΒΆ
You know the drill:
AuthenticateΒΆ
Or with API keys:
Set Up Your SecretsΒΆ
For this pipeline, you'll want: - Hugging Face token - For downloading datasets - Weights & Biases API key - For tracking training (highly recommended!)
Create the Modal secret:
modal secret create nanochat-secrets \
WANDB_API_KEY=xxxxxxxxxxxxx \
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxx
Get your tokens: - HF token: hf.co/settings/tokens - W&B key: wandb.ai/authorize
Or use a .env file (local development):
The script tries .env
first, then falls back to Modal secrets.
Clone NanochatΒΆ
This is important - we need the nanochat repo locally:
Your folder structure should be:
ServerLessFinetuning/
βββ TrainNanochatModal.py # Your Modal script
βββ nanochat/ # Cloned repo
βββ scripts/
βββ nanochat/
βββ rustbpe/
Why clone it? Modal copies this entire directory into the container image. This way we have all of nanochat's scripts and utilities available.
Understanding the PipelineΒΆ
Let me walk you through what makes this different from the other tutorials. This isn't fine-tuning - this is building everything from scratch.
The Full SpeedrunΒΆ
Nanochat is designed as a "speedrun" - train a working ChatGPT in one go. Here's what each stage does:
Stage 1: Dataset Download - Downloads FineWeb-edu-100B from HuggingFace - 240 shards Γ 250M characters = ~60B characters - Enough to train a 561M parameter model (Chinchilla optimal)
Stage 2: Tokenizer Training - Trains a BPE tokenizer on 2B characters - Creates a 65K vocab (2^16 tokens) - Same approach as GPT-4
Stage 3: Base Pretraining - Trains a GPT from random initialization - Uses Muon optimizer (better than Adam for LLMs) - Learns language from raw text
Stage 4: Midtraining - Teaches conversation format (user/assistant turns) - Adds tool use (calculator) - Trains on SmolTalk + MMLU + GSM8K
Stage 5: Supervised Fine-tuning - Task-specific training - MMLU (knowledge), ARC (reasoning), GSM8K (math), HumanEval (code)
Stage 6: Reinforcement Learning (Optional) - GRPO/REINFORCE on math problems - Improves reasoning through self-play
Stage 7: Evaluation - CORE metric (comprehensive benchmark) - Task-specific evals (ARC, GSM8K, HumanEval, MMLU)
Stage 8: Inference - Chat CLI for interactive testing - FastAPI web UI for demos
Configuration and SetupΒΆ
Alright, let's dive into the code. The configuration is straightforward but important.
App and VolumesΒΆ
from modal import App, Image as ModalImage, Volume, Secret
app = App("nanochat-training")
# Two volumes: one for data, one for checkpoints
data_volume = Volume.from_name("nanochat-data", create_if_missing=True)
checkpoint_volume = Volume.from_name("nanochat-checkpoints", create_if_missing=True)
VOLUME_CONFIG = {
"/data": data_volume, # Dataset and cache
"/checkpoints": checkpoint_volume, # Model checkpoints
}
Why two volumes? - Data volume: FineWeb shards, tokenizer, eval data (~30GB) - Checkpoint volume: Model weights, training state (~20GB) - Separation makes it easier to manage and debug
Configuration ConstantsΒΆ
MINUTES = 60
HOURS = 60 * 60
# GPU type - can be changed based on your needs
GPU_TYPE = "a100-80gb"
# Multi-GPU configuration (nanochat supports 1-8 GPUs)
NUM_GPUS_BASE = 4 # Base pretraining
NUM_GPUS_MID = 4 # Midtraining
NUM_GPUS_SFT = 4 # Supervised fine-tuning
NUM_GPUS_RL = 4 # Reinforcement learning
NUM_GPUS_EVAL = 4 # Evaluation
NUM_GPUS_TOKENIZER = 1 # Tokenizer (single GPU)
NUM_GPUS_INFERENCE = 1 # Inference (single GPU)
BASE_DIR = "/data/.cache/nanochat" # Everything goes here
GPU scaling: - 1 GPU: Full pipeline takes ~24 hours (~\(84) - 4 GPUs: Full pipeline takes ~6 hours (~\)96) - 8 GPUs: Full pipeline takes ~4 hours (~$112)
The sweet spot is 4 GPUs - good parallelism without too much communication overhead.
Secrets SetupΒΆ
# Try .env first, then Modal secrets
try:
nanochat_secret = Secret.from_dotenv()
print("Loaded secrets from .env file")
except Exception:
try:
nanochat_secret = Secret.from_name("nanochat-secrets")
print("Loaded secrets from Modal")
except Exception:
nanochat_secret = None
print("No secrets found - WandB logging disabled")
This is graceful - works with or without secrets. If you don't have W&B, training still works, you just don't get the nice dashboards.
Building the Container ImageΒΆ
This is the most complex image we've built yet. We need CUDA, PyTorch, Rust (for the tokenizer), and all of nanochat's dependencies.
Why This Image is DifferentΒΆ
NANOCHAT_IMAGE = (
# NVIDIA CUDA 12.8 with Python 3.11
ModalImage.from_registry("nvidia/cuda:12.8.1-devel-ubuntu24.04", add_python="3.11")
# System dependencies
.apt_install("git", "build-essential", "curl", "wget", "unzip")
# Install Rust (needed for tokenizer)
.run_commands(
"curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y",
"echo 'source $HOME/.cargo/env' >> $HOME/.bashrc",
)
# Install uv (fast Python package installer)
.run_commands(
"curl -LsSf https://astral.sh/uv/install.sh | sh",
"echo 'export PATH=\"$HOME/.cargo/bin:$PATH\"' >> $HOME/.bashrc",
)
# Copy nanochat repo into the image
.add_local_dir(local_path="nanochat", remote_path="/root/nanochat", copy=True)
# Set working directory
.workdir("/root/nanochat")
# Install Python dependencies AND build Rust tokenizer
.run_commands(
"bash -c 'source $HOME/.cargo/env && uv sync && uv run maturin develop --release --manifest-path rustbpe/Cargo.toml'"
)
# Environment variables
.env({
"OMP_NUM_THREADS": "1",
"NANOCHAT_BASE_DIR": "/data/.cache/nanochat",
"HF_HOME": "/data/.cache/huggingface",
})
)
Key points:
-
Rust installation: Nanochat's tokenizer is written in Rust for speed. We need the full Rust toolchain.
-
uv sync: This reads
pyproject.toml
and installs all dependencies in a virtual environment. Much faster than pip. -
maturin develop: Builds the Rust tokenizer and makes it importable from Python. This is the magic that makes nanochat's tokenizer so fast.
-
add_local_dir: Copies your local nanochat clone into the image. This is why you need to clone it first.
β° Build time warning: First build takes 15-20 minutes. The Rust tokenizer compilation is the slow part. But Modal caches everything - subsequent runs are instant.
Helper FunctionsΒΆ
Before we get to the stages, let's look at the helper functions that make everything work.
Setup FunctionsΒΆ
def setup_base_dir():
"""Create directory structure for nanochat artifacts."""
import os
os.makedirs(BASE_DIR, exist_ok=True)
os.makedirs(f"{BASE_DIR}/base_data", exist_ok=True)
os.makedirs(f"{BASE_DIR}/tokenizer", exist_ok=True)
os.makedirs(f"{BASE_DIR}/checkpoints", exist_ok=True)
os.makedirs(f"{BASE_DIR}/eval_bundle", exist_ok=True)
os.makedirs(f"{BASE_DIR}/report", exist_ok=True)
def setup_secrets():
"""Set up environment variables from secrets."""
import os
if "WANDB_API_KEY" in os.environ:
print("WandB API key found")
else:
print("WandB API key not found - logging disabled")
if "HUGGINGFACE_TOKEN" in os.environ:
os.environ["HF_TOKEN"] = os.environ["HUGGINGFACE_TOKEN"]
print("HuggingFace token found")
else:
print("HuggingFace token not found")
Simple but important - creates all the directories nanochat expects and sets up authentication.
Torchrun HelperΒΆ
def run_torchrun_command(script: str, num_gpus: int, extra_args: list = None):
"""Run nanochat script with torchrun for multi-GPU training."""
import subprocess
if extra_args is None:
extra_args = []
extra_args_str = " ".join(extra_args) if extra_args else ""
cmd = f"cd /root/nanochat && uv run torchrun --standalone --nproc_per_node={num_gpus} -m {script}"
if extra_args:
cmd += f" -- {extra_args_str}"
print(f"Running: {cmd}")
result = subprocess.run(["bash", "-c", cmd], capture_output=False, text=True)
if result.returncode != 0:
raise RuntimeError(f"Command failed with code {result.returncode}")
return result
What torchrun does: - Spawns one process per GPU - Sets up distributed training (NCCL backend) - Handles rank assignment and synchronization - Makes multi-GPU training "just work"
This is how nanochat does distributed training without writing custom distributed code.
Stage 1: Dataset DownloadΒΆ
Let's start with the data. We're downloading FineWeb-edu, which is basically "the good parts of the internet."
@app.function(
image=NANOCHAT_IMAGE,
volumes=VOLUME_CONFIG,
timeout=2 * HOURS,
# No GPU - saves money!
)
def download_dataset(num_shards: int = 240):
"""
Download FineWeb dataset shards from HuggingFace.
Each shard is ~250M characters (~100MB compressed).
- Full speedrun: 240 shards (~60B characters, ~24GB)
- Testing: 8 shards (~2B characters, ~800MB)
"""
import subprocess
setup_base_dir()
print("=" * 80)
print(f"DOWNLOADING FINEWEB DATASET - {num_shards} SHARDS")
print("=" * 80)
print(f"Total data: ~{num_shards * 250 / 1000:.1f}B characters (~{num_shards * 100 / 1024:.1f}GB)")
print()
result = subprocess.run(
["bash", "-c", f"cd /root/nanochat && uv run python -m nanochat.dataset -n {num_shards}"],
capture_output=True,
text=True,
)
print(result.stdout)
if result.stderr:
print("STDERR:", result.stderr)
if result.returncode != 0:
raise RuntimeError(f"Dataset download failed with code {result.returncode}")
data_volume.commit()
print("\n" + "=" * 80)
print(f"Downloaded {num_shards} shards successfully")
print("=" * 80)
return {
"status": "completed",
"num_shards": num_shards,
"data_dir": f"{BASE_DIR}/base_data",
}
How many shards do you need?
Model Size | Parameters | Chinchilla Tokens | Characters Needed | Shards |
---|---|---|---|---|
Tiny test | ~100M | 2B | ~10B | 40 |
Quick test | ~200M | 4B | ~20B | 80 |
Speedrun | 561M (d20) | 11B | ~54B | 240 |
GPT-2 grade | ~1B (d26) | 20B | ~96B | 400 |
For testing, use 8 shards. For the real speedrun, use 240.
Run it:
# Full speedrun dataset
modal run TrainNanochatModal.py::download_dataset
# Quick test dataset
modal run TrainNanochatModal.py::download_dataset --num-shards=8
First download takes 30-60 minutes depending on internet speed. Cached after that.
Stage 2: Tokenizer TrainingΒΆ
Now we train a custom tokenizer on the FineWeb data. This is similar to how GPT-4's tokenizer was trained.
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_TOKENIZER}",
volumes=VOLUME_CONFIG,
timeout=2 * HOURS,
)
def train_tokenizer(
max_chars: int = 2_000_000_000, # 2 billion characters
vocab_size: int = 65536, # 65K vocab (2^16)
doc_cap: int = 10000, # Max chars per document
):
"""
Train a custom BPE tokenizer on FineWeb data.
Training takes 30-60 minutes on a single GPU.
"""
import subprocess
setup_base_dir()
print("=" * 80)
print("TRAINING CUSTOM BPE TOKENIZER")
print("=" * 80)
print(f"Max characters: {max_chars:,}")
print(f"Vocabulary size: {vocab_size:,}")
print(f"Document cap: {doc_cap:,}")
print()
cmd = f"cd /root/nanochat && uv run python -m scripts.tok_train --max_chars={max_chars} --vocab_size={vocab_size} --doc_cap={doc_cap}"
print(f"Running: {cmd}")
result = subprocess.run(["bash", "-c", cmd], capture_output=False, text=True)
if result.returncode != 0:
raise RuntimeError(f"Tokenizer training failed with code {result.returncode}")
data_volume.commit()
checkpoint_volume.commit()
print("\n" + "=" * 80)
print("Tokenizer training completed")
print(f"Tokenizer saved to {BASE_DIR}/tokenizer/")
print("=" * 80)
return {
"status": "completed",
"max_chars": max_chars,
"vocab_size": vocab_size,
"tokenizer_dir": f"{BASE_DIR}/tokenizer",
}
Why train a custom tokenizer?
You might wonder "why not use GPT-2's tokenizer?" Here's why:
- Domain-specific: FineWeb has different token distributions than GPT-2's training data
- Efficiency: Custom tokenizers compress better (lower bits per byte)
- Control: You decide vocab size, special tokens, etc.
- Learning: Understanding tokenization is crucial for understanding LLMs
The training process: - Samples 2B characters from FineWeb - Runs Byte Pair Encoding (BPE) to find common subwords - Creates a 65K vocabulary - Saves tokenizer model and merges
Run it:
Takes 30-60 minutes on a single GPU. You can also evaluate it:
This shows you the compression ratio and other metrics.
Stage 3: Base Model PretrainingΒΆ
Here's the big one. We're training a GPT from scratch on internet text. This is where most of the compute goes.
Model ArchitectureΒΆ
Nanochat uses a simple but effective architecture:
depth: int = 20 # Model depth parameter
# Derived dimensions:
model_dim = depth * 64 # 20 * 64 = 1280
num_heads = model_dim / 128 # 1280 / 128 = 10
num_layers = depth # 20 layers
# Total parameters: ~561M
Model sizes: - depth=12: ~200M params (quick test) - depth=20: ~561M params (default speedrun) - depth=26: ~1B params (GPT-2 grade)
The Training FunctionΒΆ
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_BASE}",
volumes=VOLUME_CONFIG,
secrets=[nanochat_secret] if nanochat_secret else [],
timeout=8 * HOURS,
)
def train_base_model(
depth: int = 20,
device_batch_size: int = 32,
max_iterations: int = -1, # -1 = auto from Chinchilla
wandb_run: str = "dummy",
):
"""
Pretrain the base GPT model on FineWeb.
Model sizes: depth=20 (561M params), depth=26 (1B params)
Training duration: ~2-3 hours on 8 GPUs, ~16-24 hours on 1 GPU
"""
import subprocess
import os
setup_base_dir()
setup_secrets()
# Download eval bundle if needed (for CORE metric)
eval_bundle_path = f"{BASE_DIR}/eval_bundle"
if not os.path.exists(eval_bundle_path):
print("Downloading eval bundle...")
subprocess.run([
"curl", "-L", "-o", "eval_bundle.zip",
"https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip",
], check=True)
subprocess.run(["unzip", "-q", "eval_bundle.zip"], check=True)
subprocess.run(["mv", "eval_bundle", eval_bundle_path], check=True)
subprocess.run(["rm", "eval_bundle.zip"], check=True)
print("=" * 80)
print("PRETRAINING BASE MODEL ON FINEWEB")
print("=" * 80)
print(f"Model depth: {depth}")
print(f"Estimated parameters: {depth * depth * 64 * 12 // 1_000_000}M")
print(f"Device batch size: {device_batch_size}")
print(f"Number of GPUs: {NUM_GPUS_BASE}")
print(f"WandB run: {wandb_run}")
print()
extra_args = [
f"--depth={depth}",
f"--device_batch_size={device_batch_size}",
f"--run={wandb_run}",
]
if max_iterations > 0:
extra_args.append(f"--num_iterations={max_iterations}")
run_torchrun_command("scripts.base_train", NUM_GPUS_BASE, extra_args)
checkpoint_volume.commit()
print("\n" + "=" * 80)
print("Base model training completed")
print(f"Checkpoints saved to {BASE_DIR}/checkpoints/base/")
print("=" * 80)
return {
"status": "completed",
"depth": depth,
"device_batch_size": device_batch_size,
"num_gpus": NUM_GPUS_BASE,
"checkpoint_dir": f"{BASE_DIR}/checkpoints/base",
}
What happens during training:
- Initialization: Random weights, Muon optimizer setup
- Training loop: Read batches, forward pass, backward pass, update weights
- Evaluation: Every N steps, compute validation loss and CORE metric
- Checkpointing: Save model state every N steps
- Logging: Send metrics to W&B (if configured)
The Muon Optimizer:
Nanochat uses Muon for linear layers and AdamW for embeddings. Muon is a new optimizer that: - Converges faster than Adam for transformers - Uses less memory (no momentum for linear layers) - More stable for large learning rates
It's basically "what if we applied optimizer improvements from recent research?"
Training Duration and CostΒΆ
Setup | GPUs | Time | Cost |
---|---|---|---|
Quick test (d12, 8 shards) | 1 | ~2 hours | ~$7 |
Quick test (d12, 8 shards) | 4 | ~30 min | ~$7 |
Full speedrun (d20, 240 shards) | 1 | ~20 hours | ~$70 |
Full speedrun (d20, 240 shards) | 4 | ~5 hours | ~$70 |
Full speedrun (d20, 240 shards) | 8 | ~3 hours | ~$84 |
GPT-2 grade (d26, 400 shards) | 8 | ~10 hours | ~$280 |
Run it:
# Quick test (make sure it works)
modal run TrainNanochatModal.py::train_base_model \
--depth=12 \
--device-batch-size=32 \
--max-iterations=1000
# Full speedrun
modal run TrainNanochatModal.py::train_base_model \
--depth=20 \
--device-batch-size=32 \
--wandb-run="my-speedrun-$(date +%Y%m%d)"
Monitor training: - Modal dashboard shows real-time logs and GPU utilization - W&B shows loss curves, learning rate schedule, CORE metric - All 4 (or 8) GPUs should be at ~95-100% utilization
EvaluationΒΆ
After training (or during), you can evaluate:
# CORE metric (comprehensive benchmark)
modal run TrainNanochatModal.py::evaluate_base_model
# Validation loss (bits per byte)
modal run TrainNanochatModal.py::evaluate_base_loss
CORE metric measures general language understanding. A good d20 model gets ~0.35-0.40. For comparison: - Random: 0.0 - Llama 3 8B: ~0.50 - GPT-4: ~0.60
Stage 4: MidtrainingΒΆ
Now we teach the model how to have conversations. This is where it learns special tokens like <|user_start|>
, <|assistant_end|>
, etc.
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_MID}",
volumes=VOLUME_CONFIG,
secrets=[nanochat_secret] if nanochat_secret else [],
timeout=2 * HOURS,
)
def train_mid_model(
device_batch_size: int = 32,
wandb_run: str = "dummy",
):
"""
Midtrain the model on conversation data.
Teaches conversation tokens, tool use, and multiple choice format.
Duration: ~30-45 minutes on 8 GPUs
"""
setup_secrets()
print("=" * 80)
print("MIDTRAINING - TEACHING CONVERSATION TOKENS")
print("=" * 80)
print(f"Device batch size: {device_batch_size}")
print(f"Number of GPUs: {NUM_GPUS_MID}")
print()
extra_args = [
f"--device_batch_size={device_batch_size}",
f"--run={wandb_run}",
]
run_torchrun_command("scripts.mid_train", NUM_GPUS_MID, extra_args)
checkpoint_volume.commit()
print("\n" + "=" * 80)
print("Midtraining completed")
print(f"Checkpoints saved to {BASE_DIR}/checkpoints/mid/")
print("=" * 80)
return {
"status": "completed",
"checkpoint_dir": f"{BASE_DIR}/checkpoints/mid",
}
What midtraining teaches:
-
Conversation format:
-
Tool use:
-
Multiple choice:
Training data: - SmolTalk: 460K conversations - MMLU auxiliary: 100K multiple choice - GSM8K: 8K math problems with calculator
Run it:
Takes 30-45 minutes on 4-8 GPUs, costs ~$5-7.
Stage 5: Supervised Fine-tuningΒΆ
Now we specialize the model on specific tasks: knowledge, reasoning, math, code, and chat.
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_SFT}",
volumes=VOLUME_CONFIG,
secrets=[nanochat_secret] if nanochat_secret else [],
timeout=2 * HOURS,
)
def train_sft_model(
device_batch_size: int = 4,
num_epochs: int = 1,
wandb_run: str = "dummy",
source: str = "mid",
):
"""
Supervised fine-tuning on task-specific data.
Trains on MMLU, ARC, GSM8K, HumanEval, and SmolTalk.
Duration: ~30-45 minutes on 8 GPUs
"""
setup_secrets()
print("=" * 80)
print("SUPERVISED FINE-TUNING")
print("=" * 80)
print(f"Source: {source}")
print(f"Device batch size: {device_batch_size}")
print(f"Number of GPUs: {NUM_GPUS_SFT}")
print(f"Epochs: {num_epochs}")
print()
extra_args = [
f"--device_batch_size={device_batch_size}",
f"--num_epochs={num_epochs}",
f"--run={wandb_run}",
f"--source={source}",
]
run_torchrun_command("scripts.chat_sft", NUM_GPUS_SFT, extra_args)
checkpoint_volume.commit()
print("\n" + "=" * 80)
print("SFT completed")
print(f"Checkpoints saved to {BASE_DIR}/checkpoints/sft/")
print("=" * 80)
return {
"status": "completed",
"checkpoint_dir": f"{BASE_DIR}/checkpoints/sft",
}
Training mixture:
Dataset | Examples | Task | Weight |
---|---|---|---|
MMLU | ~14K | General knowledge | 25% |
ARC-Easy | ~2.4K | Science reasoning | 15% |
ARC-Challenge | ~1.2K | Hard science | 15% |
GSM8K | ~7.5K | Math with tools | 20% |
HumanEval | ~164 | Code generation | 5% |
SmolTalk | ~50K | Conversations | 20% |
The mixture is designed to produce a well-rounded model that can chat, reason, do math, and write code.
Run it:
# Start from midtrained model (recommended)
modal run TrainNanochatModal.py::train_sft_model --source=mid
# Or start from base model (skip midtraining)
modal run TrainNanochatModal.py::train_sft_model --source=base
Takes 30-45 minutes on 4-8 GPUs, costs ~$5-7.
Stage 6: Reinforcement Learning (Optional)ΒΆ
This is optional but improves math reasoning significantly.
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_RL}",
volumes=VOLUME_CONFIG,
secrets=[nanochat_secret] if nanochat_secret else [],
timeout=2 * HOURS,
)
def train_rl_model(
device_batch_size: int = 8,
num_epochs: int = 1,
wandb_run: str = "dummy",
source: str = "sft",
):
"""
Reinforcement learning on GSM8K (optional).
Uses GRPO/REINFORCE to improve math reasoning.
Duration: ~30-45 minutes on 8 GPUs
"""
setup_secrets()
print("=" * 80)
print("REINFORCEMENT LEARNING ON GSM8K")
print("=" * 80)
print(f"Source: {source}")
print(f"Device batch size: {device_batch_size}")
print(f"Number of GPUs: {NUM_GPUS_RL}")
print(f"Epochs: {num_epochs}")
print()
extra_args = [
f"--device_batch_size={device_batch_size}",
f"--num_epochs={num_epochs}",
f"--run={wandb_run}",
f"--source={source}",
]
run_torchrun_command("scripts.chat_rl", NUM_GPUS_RL, extra_args)
checkpoint_volume.commit()
print("\n" + "=" * 80)
print("RL training completed")
print(f"Checkpoints saved to {BASE_DIR}/checkpoints/rl/")
print("=" * 80)
return {
"status": "completed",
"checkpoint_dir": f"{BASE_DIR}/checkpoints/rl",
}
How RL works here:
- Model generates multiple answers to a math problem
- Answers that get the right solution get positive reward
- Answers that get the wrong solution get negative reward
- Update model to generate more correct answers
This is simplified GRPO (Group Relative Policy Optimization). It's like PPO but simpler and works well for math.
Expected improvement: - GSM8K accuracy: 60% β 75% after RL - ARC/MMLU: Stays about the same (RL only trains on math)
Run it:
Takes 30-45 minutes on 4-8 GPUs, costs ~$5-7.
Stage 7: EvaluationΒΆ
Now let's measure how good our model actually is. Nanochat includes comprehensive evaluation on real benchmarks.
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_EVAL}",
volumes=VOLUME_CONFIG,
timeout=2 * HOURS,
)
def evaluate_chat_model(
source: str = "sft",
tasks: str = "all",
):
"""
Evaluate the chat model on benchmark tasks.
Available tasks: ARC-Easy, ARC-Challenge, GSM8K, HumanEval, MMLU, ChatCORE
"""
print("=" * 80)
print(f"EVALUATING CHAT MODEL - {source.upper()}")
print("=" * 80)
print(f"Tasks: {tasks}")
print()
extra_args = ["-i", source]
if tasks != "all":
extra_args.extend(["-a", tasks])
run_torchrun_command("scripts.chat_eval", NUM_GPUS_EVAL, extra_args)
checkpoint_volume.commit()
print("\n" + "=" * 80)
print(f"Evaluation of {source} model completed")
print("=" * 80)
return {
"status": "completed",
"source": source,
"tasks": tasks,
}
Available benchmarks:
Benchmark | What it measures | Good score |
---|---|---|
ARC-Easy | Elementary science | >60% |
ARC-Challenge | Hard science | >35% |
GSM8K | Grade school math | >60% |
HumanEval | Code generation | >20% |
MMLU | General knowledge | >50% |
ChatCORE | Conversation quality | >0.40 |
Run evaluation:
# Evaluate all tasks
modal run TrainNanochatModal.py::evaluate_chat_model --source=sft
# Evaluate specific tasks
modal run TrainNanochatModal.py::evaluate_chat_model \
--source=rl \
--tasks="GSM8K,ARC-Challenge"
# Compare mid vs sft vs rl
modal run TrainNanochatModal.py::evaluate_chat_model --source=mid
modal run TrainNanochatModal.py::evaluate_chat_model --source=sft
modal run TrainNanochatModal.py::evaluate_chat_model --source=rl
Takes 30-60 minutes depending on which tasks you run. Results are saved to the volume and logged to W&B.
Stage 8: InferenceΒΆ
Finally, let's chat with our model! Nanochat provides both a CLI and a web UI.
Chat CLIΒΆ
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_INFERENCE}",
volumes=VOLUME_CONFIG,
timeout=1 * HOURS,
)
def chat_cli(
source: str = "sft",
prompt: str = "",
temperature: float = 0.6,
top_k: int = 50,
):
"""Chat with the model via command line interface."""
import subprocess
print("=" * 80)
print(f"CHAT CLI - {source.upper()} MODEL")
print("=" * 80)
cmd = f"cd /root/nanochat && uv run python -m scripts.chat_cli -i {source} -t {temperature} -k {top_k}"
if prompt:
escaped_prompt = prompt.replace('"', '\\"')
cmd += f' -p "{escaped_prompt}"'
print(f"Running: {cmd}")
result = subprocess.run(["bash", "-c", cmd], capture_output=False, text=True)
if result.returncode != 0:
raise RuntimeError(f"Chat CLI failed with code {result.returncode}")
return {
"status": "completed",
"source": source,
"prompt": prompt,
}
Use it:
# Single question
modal run TrainNanochatModal.py::chat_cli \
--source=sft \
--prompt="Why is the sky blue?"
# Interactive mode (empty prompt)
modal run TrainNanochatModal.py::chat_cli --source=sft
# With different sampling
modal run TrainNanochatModal.py::chat_cli \
--source=rl \
--temperature=0.8 \
--top-k=100
Chat Web UIΒΆ
For a more user-friendly interface:
@app.function(
image=NANOCHAT_IMAGE,
gpu=f"{GPU_TYPE}:{NUM_GPUS_INFERENCE}",
volumes=VOLUME_CONFIG,
timeout=4 * HOURS,
max_containers=2,
)
def chat_web(
source: str = "sft",
port: int = 8000,
temperature: float = 0.8,
top_k: int = 50,
max_tokens: int = 512,
):
"""Serve the chat model via a web UI."""
import subprocess
print("=" * 80)
print(f"STARTING WEB UI - {source.upper()} MODEL")
print("=" * 80)
print(f"Port: {port}")
print(f"Temperature: {temperature}")
print(f"Top-k: {top_k}")
print(f"Max tokens: {max_tokens}")
print()
cmd = f"cd /root/nanochat && uv run python -m scripts.chat_web -i {source} -p {port} -t {temperature} -k {top_k} -m {max_tokens} --host 0.0.0.0"
print(f"Running: {cmd}")
print("\n" + "=" * 80)
print(f"Web UI will be available at: http://localhost:{port}")
print("=" * 80)
print()
result = subprocess.run(["bash", "-c", cmd], capture_output=False, text=True)
if result.returncode != 0:
raise RuntimeError(f"Web server failed with code {result.returncode}")
return {
"status": "completed",
"source": source,
"port": port,
}
Deploy it:
Modal gives you a URL. Open it in your browser and you have a ChatGPT-like interface!
Running the Complete PipelineΒΆ
Alright, let's put it all together. Here's how to run the full speedrun from scratch.
The Main PipelineΒΆ
@app.local_entrypoint()
def main(
run_download: bool = True,
run_tokenizer: bool = True,
run_base: bool = True,
run_mid: bool = True,
run_sft: bool = True,
run_rl: bool = False,
run_eval: bool = True,
run_inference: bool = True,
num_data_shards: int = 240,
depth: int = 20,
device_batch_size_base: int = 32,
device_batch_size_sft: int = 4,
wandb_run: str = "dummy",
):
"""
Run the complete nanochat pipeline from scratch.
Configuration modes:
- Full Speedrun (4h, $96): num_data_shards=240, depth=20
- Quick Test (1h, $24): num_data_shards=8, depth=12
- GPT-2 Grade (12h, $288): num_data_shards=450, depth=26
"""
print("=" * 80)
print("NANOCHAT TRAINING PIPELINE")
print("=" * 80)
print(f"Mode: {'Speedrun' if num_data_shards >= 240 else 'Quick Test'}")
print(f"Data shards: {num_data_shards}")
print(f"Model depth: {depth}")
print(f"WandB run: {wandb_run}")
print("=" * 80)
print()
if run_download:
print("Stage 1/8: Downloading dataset...")
download_dataset.remote(num_shards=num_data_shards)
if run_tokenizer:
print("\nStage 2/8: Training tokenizer...")
train_tokenizer.remote()
print("Evaluating tokenizer...")
evaluate_tokenizer.remote()
if run_base:
print("\nStage 3/8: Training base model...")
train_base_model.remote(
depth=depth, device_batch_size=device_batch_size_base, wandb_run=wandb_run
)
if run_eval:
print("Evaluating base model (CORE)...")
evaluate_base_model.remote()
print("Evaluating base model (loss)...")
evaluate_base_loss.remote()
if run_mid:
print("\nStage 4/8: Midtraining (conversation tokens)...")
train_mid_model.remote(
device_batch_size=device_batch_size_base, wandb_run=wandb_run
)
if run_eval:
print("Evaluating mid model...")
evaluate_chat_model.remote(source="mid")
if run_sft:
print("\nStage 5/8: Supervised fine-tuning...")
train_sft_model.remote(
device_batch_size=device_batch_size_sft, wandb_run=wandb_run, source="mid"
)
if run_eval:
print("Evaluating SFT model...")
evaluate_chat_model.remote(source="sft")
if run_rl:
print("\nStage 6/8: Reinforcement learning...")
train_rl_model.remote(wandb_run=wandb_run)
if run_eval:
print("Evaluating RL model...")
evaluate_chat_model.remote(source="rl", tasks="GSM8K")
if run_inference:
print("\nStage 7/8: Testing inference...")
final_source = "rl" if run_rl else "sft"
chat_cli.remote(source=final_source, prompt="Why is the sky blue?")
print("\n" + "=" * 80)
print("PIPELINE COMPLETED")
print("=" * 80)
print()
print("Next steps:")
print("1. Chat via CLI: modal run TrainNanochatModal.py::chat_cli --source=sft")
print("2. Launch Web UI: modal run TrainNanochatModal.py::chat_web --source=sft")
print("3. Run more evals: modal run TrainNanochatModal.py::evaluate_chat_model --source=sft")
print()
Quick Test Run (1 hour, $24)ΒΆ
This trains a tiny model on a small dataset. Good for making sure everything works.
Full Speedrun (4 hours, $96)ΒΆ
modal run TrainNanochatModal.py \
--num-data-shards=240 \
--depth=20 \
--wandb-run="speedrun-$(date +%Y%m%d)" \
--run-rl=False
This is the default nanochat speedrun. Produces a working ChatGPT with good performance.
GPT-2 Grade Model (12 hours, $288)ΒΆ
modal run TrainNanochatModal.py \
--num-data-shards=450 \
--depth=26 \
--wandb-run="gpt2-grade-$(date +%Y%m%d)" \
--run-rl=True
This trains a 1B parameter model. Comparable to GPT-2 in quality.
Running Stages IndividuallyΒΆ
You can also run each stage separately:
# 1. Download dataset
modal run TrainNanochatModal.py::download_dataset --num-shards=240
# 2. Train tokenizer
modal run TrainNanochatModal.py::train_tokenizer
modal run TrainNanochatModal.py::evaluate_tokenizer
# 3. Train base model
modal run TrainNanochatModal.py::train_base_model \
--depth=20 \
--wandb-run="my-run"
# 4. Evaluate base
modal run TrainNanochatModal.py::evaluate_base_model
modal run TrainNanochatModal.py::evaluate_base_loss
# 5. Midtraining
modal run TrainNanochatModal.py::train_mid_model
# 6. SFT
modal run TrainNanochatModal.py::train_sft_model
# 7. Optional RL
modal run TrainNanochatModal.py::train_rl_model
# 8. Evaluate chat model
modal run TrainNanochatModal.py::evaluate_chat_model --source=sft
# 9. Chat with it
modal run TrainNanochatModal.py::chat_cli --source=sft
modal run TrainNanochatModal.py::chat_web --source=sft
This is great for development - run expensive stages once, then iterate on later stages.
Hyperparameter TuningΒΆ
Model SizeΒΆ
The depth
parameter controls model size:
depth = 12 # ~200M params (quick test)
depth = 16 # ~350M params
depth = 20 # ~561M params (default speedrun)
depth = 24 # ~800M params
depth = 26 # ~1B params (GPT-2 grade)
depth = 32 # ~1.8B params (requires more GPUs)
Scaling laws: - Parameters β depthΒ² Γ 64 Γ 12 - Training tokens β parameters Γ 20 (Chinchilla optimal)
Batch SizeΒΆ
device_batch_size = 32 # Per GPU batch size
# Effective batch size = device_batch_size Γ num_gpus Γ gradient_accumulation
# nanochat uses gradient_accumulation=1 by default
Guidelines: - 1 GPU: batch_size=16-32 - 4 GPUs: batch_size=32-64 (per device) - 8 GPUs: batch_size=32-64 (per device)
Learning RateΒΆ
Base model uses Muon optimizer with these defaults:
learning_rate = 0.01 # Muon (much higher than Adam!)
lr_schedule = "cosine" # Cosine decay to 10% of peak
warmup_ratio = 0.1 # Warm up for 10% of training
SFT and RL use AdamW:
Dataset SizeΒΆ
Model | Parameters | Optimal Tokens | Shards Needed |
---|---|---|---|
d12 | 200M | 4B | ~80 |
d16 | 350M | 7B | ~140 |
d20 | 561M | 11B | ~220 |
d26 | 1B | 20B | ~400 |
Add 10-20% buffer because some tokens are filtered out.
Cost BreakdownΒΆ
Based on Modal pricing (~$3.50/hr for A100-80GB):
Full Speedrun (d20, 240 shards, 4 GPUs)ΒΆ
Stage | GPUs | Duration | Cost |
---|---|---|---|
Download dataset | CPU | 30 min | $0.01 |
Train tokenizer | 1 | 45 min | $2.50 |
Base pretraining | 4 | 5 hours | $70 |
Midtraining | 4 | 30 min | $7 |
SFT | 4 | 30 min | $7 |
RL (optional) | 4 | 30 min | $7 |
Evaluation | 4 | 30 min | $7 |
Total | ~8 hours | ~$100 |
Quick Test (d12, 8 shards, 4 GPUs)ΒΆ
Stage | GPUs | Duration | Cost |
---|---|---|---|
Download dataset | CPU | 5 min | $0.01 |
Train tokenizer | 1 | 30 min | $1.75 |
Base pretraining | 4 | 30 min | $7 |
Midtraining | 4 | 15 min | $3.50 |
SFT | 4 | 15 min | $3.50 |
Total | ~2 hours | ~$16 |
Storage CostsΒΆ
- Volumes: Free up to 50GB
- This project: ~30-40GB = $0/month
Monitoring and DebuggingΒΆ
Real-time MonitoringΒΆ
When you run training, Modal gives you a URL:
The dashboard shows: - Real-time logs from all GPUs - GPU utilization (should be 95-100%) - Memory usage - Cost accumulation - Function status
Weights & BiasesΒΆ
If you set up W&B, check wandb.ai/<username>/nanochat-modal
Charts to watch:
Base training: - Training loss (should decrease smoothly) - Validation loss (should track training loss) - CORE metric (should increase over time) - Learning rate (should follow cosine schedule)
SFT/RL: - Task-specific metrics (ARC accuracy, GSM8K accuracy, etc.) - Loss curves - Gradient norms
Checking OutputsΒΆ
# List what's in the volume
modal volume ls nanochat-data /data/.cache/nanochat
# Check checkpoints
modal volume ls nanochat-checkpoints /data/.cache/nanochat/checkpoints
# Download something
modal volume get nanochat-checkpoints \
/data/.cache/nanochat/checkpoints/sft \
./local-checkpoint
GPU UtilizationΒΆ
For multi-GPU training, all GPUs should be utilized. If only 1 GPU shows activity:
- Check that
NUM_GPUS_BASE
is set correctly - Check torchrun is spawning multiple processes
- Check for errors in the logs
Common Issues and SolutionsΒΆ
"nanochat directory not found"ΒΆ
Error: FileNotFoundError: nanochat
Fix:
Make sure it's in the same directory as TrainNanochatModal.py
.
CUDA Out of MemoryΒΆ
Error: CUDA out of memory
Solutions:
-
Reduce batch size:
-
Use fewer GPUs (counter-intuitive, but each GPU needs memory):
-
Reduce model size:
-
Use A100-80GB instead of A100-40GB
Rust Compilation FailsΒΆ
Error: Errors during rustbpe compilation
Fix: Usually means Rust wasn't installed correctly. The image build includes Rust installation, so this should work. If it doesn't:
- Check the image build logs for Rust installation errors
- Make sure you're using the
devel
CUDA image (notruntime
) - Try rebuilding:
modal build TrainNanochatModal.py
Training Loss Not DecreasingΒΆ
Symptoms: Loss stays flat or increases
Checks:
- Verify data is loading:
- Check logs for "Loaded N batches"
-
Verify dataset was downloaded
-
Check learning rate:
- Might be too low (increase it)
-
Check W&B for LR schedule
-
Model size vs dataset size:
- Tiny model + huge dataset = might not converge
-
Huge model + tiny dataset = will overfit
-
Multi-GPU issues:
- Verify all GPUs are being used
- Check for NCCL errors in logs
Secrets Not FoundΒΆ
Error: Modal Secret "nanochat-secrets" not found
Fix:
Or use a .env
file (script tries that first).
Image Build TimeoutΒΆ
Error: Image build exceeds timeout
Fix: First build takes 15-20 minutes (Rust compilation). This is normal. If it times out:
- Increase timeout in Modal dashboard settings
- Or just wait - Modal caches completed layers, so re-running continues from where it failed
Advanced Tips and TricksΒΆ
Resume from CheckpointΒΆ
If training crashes, resume from the last checkpoint:
# Nanochat automatically saves checkpoints
# Just re-run the same command - it resumes automatically
modal run TrainNanochatModal.py::train_base_model --depth=20
Custom DatasetsΒΆ
To train on your own data:
- Format as FineWeb shards (parquet files)
- Place in
/data/.cache/nanochat/base_data
- Update num_shards to match your data
Or modify nanochat/dataset.py
to load from your source.
Experiment with OptimizersΒΆ
Edit nanochat/scripts/base_train.py
to try different optimizers: - Muon (default, best for base training) - AdamW (standard, works everywhere) - Lion (newer, sometimes faster)
Multi-Node TrainingΒΆ
For models >2B parameters, you might want multiple machines. Modal supports this with multi_node
parameter. Check Modal docs for details.
QuantizationΒΆ
For inference, you can quantize the model:
# int8 quantization
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path,
load_in_8bit=True
)
# int4 quantization (even faster)
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path,
load_in_4bit=True
)
Expected ResultsΒΆ
After training the full speedrun (d20, 240 shards), you should get approximately:
Base Model (after pretraining)ΒΆ
Metric | Expected | Good | Excellent |
---|---|---|---|
CORE | 0.35 | 0.38 | 0.40 |
Validation loss | 2.5 | 2.4 | 2.3 |
Bits per byte | 1.2 | 1.15 | 1.10 |
After SFTΒΆ
Benchmark | Expected | Good | Excellent |
---|---|---|---|
ARC-Easy | 60% | 65% | 70% |
ARC-Challenge | 30% | 35% | 40% |
GSM8K | 40% | 50% | 60% |
HumanEval | 15% | 20% | 25% |
MMLU | 45% | 50% | 55% |
After RL (optional)ΒΆ
Benchmark | Expected Improvement |
---|---|
GSM8K | +10-15% |
Others | No change (RL only trains on math) |
For context: - GPT-2 (1.5B): MMLU ~35%, GSM8K ~10% - Llama 2 7B: MMLU ~45%, GSM8K ~15% - Your d20 model (561M): MMLU ~50%, GSM8K ~40-60%
Not bad for a model you trained in 4 hours!
What's Next?ΒΆ
You've built a complete LLM training pipeline from scratch. Here's what you can do next:
1. Scale UpΒΆ
# Train a 1B parameter model
modal run TrainNanochatModal.py \
--num-data-shards=450 \
--depth=26 \
--wandb-run="gpt2-grade"
2. Custom DataΒΆ
- Collect your own dataset
- Train a domain-specific model
- E.g., code-only model, medical model, creative writing model
3. Longer TrainingΒΆ
# Train for 2x tokens (better convergence)
modal run TrainNanochatModal.py::train_base_model \
--depth=20 \
--max-iterations=20000
4. Deploy for ProductionΒΆ
Add vLLM serving (like in the Gemma tutorial): - OpenAI-compatible API - Auto-scaling - High throughput
5. Experiment with ArchitectureΒΆ
- Try different model widths
- Add mixture of experts
- Experiment with different attention mechanisms
6. Advanced Fine-tuningΒΆ
- DPO (Direct Preference Optimization)
- ORPO (Odds Ratio Preference Optimization)
- Longer RL training
ResourcesΒΆ
- Nanochat GitHub - The original repo
- Andrej's Video - Building GPT from scratch
- Modal Documentation - Everything about Modal
- Modal GPU Types - All available GPUs and pricing
- PyTorch Distributed - Understanding multi-GPU training
- Chinchilla Paper - Optimal compute budget scaling
Wrapping UpΒΆ
You just did what most people think requires a PhD and millions in compute: you trained a language model from absolute scratch.
Not fine-tuning. Not adapter training. Full pretraining - tokenizer, base model, everything.
And you did it in ~4 hours for ~$100. On Modal's serverless infrastructure. No cluster to manage, no DevOps nightmares, no month-long training runs.
The Unsloth tutorial showed you highly optimized fine-tuning. The Axolotl tutorial showed you production-scale multi-GPU training. This tutorial showed you the complete pipeline - everything from raw text to functioning ChatGPT.
This is the deepest you can go in understanding LLMs. You now know exactly how GPT works because you built one yourself.
The nanochat pipeline is used by researchers, educators, and companies. It's the real deal, just scaled to be accessible. And Modal made it trivial to run.
Got questions? Hit me up on Twitter @adithya_s_k!
Now go train your own ChatGPT. You have the power. π