Back to case study
GRPO training step replay

GRPO Reasoning Trainer (GSM8K · Qwen2.5-0.5B)

Reproducing DeepSeek-R1's GRPO with TRL's GRPOTrainer on Qwen2.5-0.5B: five verifiable rewards teach the model to emit a reasoning chain before its answer on GSM8K. Runs on one GPU.

A live reward calculator, not a replay. Edit a model completion and the gold answer; all five reward functions (correctness / int / strict_format / soft_format / xmlcount) — ported verbatim from the notebook — recompute in your browser.

GRPOTRLDeepSeek-R1GSM8KQwen2.5
GRPO Reasoning Trainer (GSM8K · Qwen2.5-0.5B)

Why this local version exists

The five reward functions are pure string logic, so they run client-side exactly as GRPOTrainer scores them. Real GRPO training still needs a GPU + the 0.5B base model — the scoring you see here is the real thing.

Live · runs in your browser

GRPO reward calculator — the real five functions

Edit a model completion and the gold answer. All five reward functions — ported verbatim from the notebook — recompute live. This is the exact scoring GRPOTrainer applies to each sampled completion before computing the group-relative advantage.

Presets

Gold answer

Model completion (editable)

Total reward

3.500

sum of the five reward functions

correctness+2.000

extracted "5" vs gold "5"

int+0.500

answer is a pure integer

strict_format+0.500

exact <reasoning>\n…\n</reasoning>\n<answer>…

soft_format+0.000

loosely contains both blocks

xmlcount+0.500

0.125 per tag − trailing-text penalty

In training, GRPOTrainer scores every sampled completion this way, then normalizes the totals within the group of num_generations=16 to get each completion's advantage. Try the presets: a clean answer scores ~3.6; correct-but-untagged loses every format reward.

What to try

Load a preset, then edit the completion — every reward updates live as you type.

Delete the </answer> tag and watch the format rewards collapse; fix the number and watch correctness jump to +2.0.

Note correctness (+2.0) dominates while the four format rewards shape the <reasoning>/<answer> structure.

What this demo proves

You can run the GRPO recipe behind DeepSeek-R1 end to end, on a model small enough to actually inspect.

You can design a stack of verifiable reward functions and explain how each shapes behavior.

You tell an honest story — demonstrating a mechanism on a 0.5B model, clear about what is and isn't claimed.

Stack

TRL GRPOTrainer · Qwen2.5-0.5B · GSM8K · single GPU

Rewards

5 funcs: correctness + int + strict/soft format + xmlcount

Real result

Reward alone induces reasoning chains on a 0.5B model