GRPO Reasoning Trainer (GSM8K · Qwen2.5-0.5B)
Reproducing DeepSeek-R1's GRPO with TRL's GRPOTrainer on Qwen2.5-0.5B: five verifiable rewards teach the model to emit a reasoning chain before its answer on GSM8K. Runs on one GPU.
A live reward calculator, not a replay. Edit a model completion and the gold answer; all five reward functions (correctness / int / strict_format / soft_format / xmlcount) — ported verbatim from the notebook — recompute in your browser.
Why this local version exists
The five reward functions are pure string logic, so they run client-side exactly as GRPOTrainer scores them. Real GRPO training still needs a GPU + the 0.5B base model — the scoring you see here is the real thing.
GRPO reward calculator — the real five functions
Edit a model completion and the gold answer. All five reward functions — ported verbatim from the notebook — recompute live. This is the exact scoring GRPOTrainer applies to each sampled completion before computing the group-relative advantage.
Presets
Gold answer
Model completion (editable)
Total reward
3.500
sum of the five reward functions
extracted "5" vs gold "5"
answer is a pure integer
exact <reasoning>\n…\n</reasoning>\n<answer>…
loosely contains both blocks
0.125 per tag − trailing-text penalty
In training, GRPOTrainer scores every sampled completion this way, then normalizes the totals within the group of num_generations=16 to get each completion's advantage. Try the presets: a clean answer scores ~3.6; correct-but-untagged loses every format reward.
What to try
Load a preset, then edit the completion — every reward updates live as you type.
Delete the </answer> tag and watch the format rewards collapse; fix the number and watch correctness jump to +2.0.
Note correctness (+2.0) dominates while the four format rewards shape the <reasoning>/<answer> structure.
What this demo proves
You can run the GRPO recipe behind DeepSeek-R1 end to end, on a model small enough to actually inspect.
You can design a stack of verifiable reward functions and explain how each shapes behavior.
You tell an honest story — demonstrating a mechanism on a 0.5B model, clear about what is and isn't claimed.
Stack
TRL GRPOTrainer · Qwen2.5-0.5B · GSM8K · single GPU
Rewards
5 funcs: correctness + int + strict/soft format + xmlcount
Real result
Reward alone induces reasoning chains on a 0.5B model