Back to case study
Build + train replay

Train LLaMA from Scratch

No API, no pretrained weights: rebuild LLaMA's decoder-only architecture (RMSNorm / RoPE / GQA / SwiGLU / KV cache) from scratch in PyTorch and train it. The foundation under everything else.

Assemble a LLaMA decoder block layer by layer (RMSNorm + RoPE + GQA + SwiGLU + KV cache), then train: the loss curve falls and the sample generation goes from gibberish to coherent.

LLaMATransformerRoPERMSNormPyTorch
Train LLaMA from Scratch

Why this local version exists

The architecture components (RMSNorm/RoPE/GQA/SwiGLU/KV-cache) are the real LLaMA structure from the course's LLaMA architecture series (a video course). The config and loss/step numbers are illustrative; no training runs in the browser.

Interactive Preview

Build LLaMA from scratch, then train

Assemble a LLaMA decoder block layer by layer (RMSNorm + RoPE + GQA + SwiGLU + KV cache), then train — watch the loss fall and generations get coherent.

Decoder Block × 8

Token Embedding

RMSNorm

RoPE Self-Attention (GQA)

RMSNorm

SwiGLU FFN

final RMSNorm → LM Head (tied)

dim 512 · 8 layers · 8 heads / 4 KV (GQA) · vocab 32000 · 示意配置

training loss

step 0

sample generation (same prompt)

Once training starts, generations move from gibberish to coherent.

What to try

Build the model and watch the decoder block assemble layer by layer.

Start training and watch the loss curve fall step by step.

Watch the same-prompt generation go from gibberish to a coherent sentence.

What this demo proves

You own the foundation — implement LLaMA from scratch, not just import a model.

You understand each modern component: RMSNorm / RoPE / GQA / SwiGLU / KV cache.

You connect layers: KV cache underpins prompt caching; architecture grounds fine-tuning/RL choices.

Architecture

Decoder-only · RMSNorm · RoPE · GQA · SwiGLU · KV cache (Pre-Norm)

From scratch

PyTorch, no pretrained weights — tensors to training loop hand-written

Best signal

Foundational depth that underpins fine-tuning, RL, and agents