GRPO Reasoning Trainer (GSM8K · Qwen2.5-0.5B)
Reproducing DeepSeek-R1's GRPO algorithm with TRL's GRPOTrainer on Qwen2.5-0.5B: five verifiable reward functions teach the model to emit a reasoning chain before its answer on GSM8K math problems.
A minimal, reproducible reproduction of the GRPO recipe behind DeepSeek-R1's reasoning behavior — run on a 0.5B model so the whole loop fits on a single consumer GPU. The point is to watch GRPO induce a <reasoning>/<answer> chain from reward alone, not to chase a leaderboard.
Overview
This project reproduces Group Relative Policy Optimization (GRPO) — the RL algorithm DeepSeek used for R1 —
on a deliberately small model: Qwen2.5-0.5B-Instruct. It uses Hugging Face TRL's GRPOTrainer and the
GSM8K grade-school math dataset. The whole point is observable: before training, the 0.5B model just blurts
an answer; after a short GRPO run, it learns to emit a structured <reasoning>…</reasoning><answer>…</answer>
chain and reason its way to the answer — driven entirely by reward, with no supervised reasoning traces.
It runs on one GPU (~17 GB, a few hours on a 3090). That accessibility is the feature: it makes the GRPO recipe something you can actually run and inspect, rather than a 100× larger thing you read about.
Why GRPO
The honest case for GRPO over PPO and DPO:
- No critic network. PPO needs a separately trained value model. GRPO uses the mean reward of a sampled group as an implicit baseline — no critic, half the model memory.
- No preference pairs. DPO needs curated chosen/rejected pairs. GRPO only needs a reward function, and for verifiable tasks like GSM8K the reward (is the final number correct?) is essentially free.
- Reasoning-friendly. Sample K completions per prompt, reward each, normalize advantages within the group — completions that reason their way to the right answer get pushed up relative to those that don't.
The setup (real config)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", torch_dtype=torch.bfloat16
).to("cuda")
dataset = load_dataset("openai/gsm8k", "main")["train"] # grade-school math, #### marks the gold answer
training_args = GRPOConfig(
learning_rate=5e-6, lr_scheduler_type="cosine", warmup_ratio=0.1,
per_device_train_batch_size=1, gradient_accumulation_steps=4,
num_generations=16, # K completions sampled per prompt — the heart of GRPO
max_prompt_length=256, max_completion_length=200,
num_train_epochs=1, max_grad_norm=0.1, bf16=True,
use_vllm=False,
)
The model is prompted with a system instruction to answer in a fixed structure:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
Reward design (five functions)
GRPO's behavior is entirely shaped by the reward stack. This reproduction wires five reward functions into
GRPOTrainer, deliberately split so each signal is observable:
| Reward function | Signal | Value |
|---|---|---|
correctness_reward_func | extracted <answer> exactly matches the GSM8K gold | +2.0 else 0 |
int_reward_func | the answer is a pure integer | +0.5 else 0 |
strict_format_reward_func | output matches the exact <reasoning>\n…\n</reasoning>\n<answer>… regex | +0.5 else 0 |
soft_format_reward_func | output loosely contains both blocks | +0.5 else 0 |
xmlcount_reward_func | graded credit per well-formed tag (0.125 each), minus a tiny penalty for trailing text | up to +0.5 |
Correctness dominates; the four format rewards shape the structure of the reasoning so it stays parseable.
This staircase — strict format, soft format, tag-counting — is what reliably pulls a tiny model toward clean
<reasoning>/<answer> output instead of free-form text.
What actually happens
This is an honest, demo-scale run, so the headline is qualitative, not a benchmark number:
- Before GRPO: asked "Joy can read 8 pages in 20 minutes. How many hours to read 120 pages?", the base 0.5B model just emits a bare answer — no reasoning, often no structure.
- After GRPO (1 epoch on GSM8K): the model reliably produces a
<reasoning>block that works the problem step by step, then a<answer>block with the number. Reward alone taught it to show its work.
The notebook doesn't claim an accuracy SOTA — a 0.5B model on a few hours of GRPO won't beat much. What it demonstrates cleanly is the mechanism: a group of sampled completions + verifiable rewards is enough to induce reasoning-chain behavior, which is exactly the recipe DeepSeek-R1 scaled up.
Honest scope
- Model: 0.5B (Qwen2.5-0.5B-Instruct) — ~100× smaller than the R1-scale models this recipe targets.
- Data/compute: full GSM8K, 1 epoch, ~17 GB VRAM, a few hours on a single 3090.
- No vLLM, no distributed training —
use_vllm=False, single device. This is a learning reproduction, not a production trainer.
What this project signals
- You understand the GRPO recipe behind DeepSeek-R1 well enough to run it end to end, not just describe it.
- You can design a stack of verifiable reward functions and reason about how each shapes behavior.
- You can tell an honest story — demonstrating a mechanism on a small model, and being clear about what is and isn't claimed.
The live demo actually runs
Not a replay: the demo is a live reward calculator. Edit a model completion and the gold answer, and all five reward functions (correctness / int / strict_format / soft_format / xmlcount) — ported verbatim from the notebook — recompute in your browser. It's the exact scoring GRPOTrainer applies before the group-relative advantage. Bilingual (EN/中文).