Train LLaMA from Scratch
No API, no pretrained weights: rebuild LLaMA's decoder-only architecture (RMSNorm / RoPE / GQA / SwiGLU / KV cache) from scratch in PyTorch and train it. The foundation under everything else.
Assemble a LLaMA decoder block layer by layer (RMSNorm + RoPE + GQA + SwiGLU + KV cache), then train: the loss curve falls and the sample generation goes from gibberish to coherent.
Why this local version exists
The architecture components (RMSNorm/RoPE/GQA/SwiGLU/KV-cache) are the real LLaMA structure from the course's LLaMA architecture series (a video course). The config and loss/step numbers are illustrative; no training runs in the browser.
Build LLaMA from scratch, then train
Assemble a LLaMA decoder block layer by layer (RMSNorm + RoPE + GQA + SwiGLU + KV cache), then train — watch the loss fall and generations get coherent.
Decoder Block × 8
Token Embedding
RMSNorm
RoPE Self-Attention (GQA)
RMSNorm
SwiGLU FFN
dim 512 · 8 layers · 8 heads / 4 KV (GQA) · vocab 32000 · 示意配置
training loss
sample generation (same prompt)
Once training starts, generations move from gibberish to coherent.
What to try
Build the model and watch the decoder block assemble layer by layer.
Start training and watch the loss curve fall step by step.
Watch the same-prompt generation go from gibberish to a coherent sentence.
What this demo proves
You own the foundation — implement LLaMA from scratch, not just import a model.
You understand each modern component: RMSNorm / RoPE / GQA / SwiGLU / KV cache.
You connect layers: KV cache underpins prompt caching; architecture grounds fine-tuning/RL choices.
Architecture
Decoder-only · RMSNorm · RoPE · GQA · SwiGLU · KV cache (Pre-Norm)
From scratch
PyTorch, no pretrained weights — tensors to training loop hand-written
Best signal
Foundational depth that underpins fine-tuning, RL, and agents