GRPO Reasoning Trainer (GSM8K · Qwen2.5-0.5B)

A minimal, reproducible reproduction of the GRPO recipe behind DeepSeek-R1's reasoning behavior — run on a 0.5B model so the whole loop fits on a single consumer GPU. The point is to watch GRPO induce a <reasoning>/<answer> chain from reward alone, not to chase a leaderboard.

Overview

This project reproduces Group Relative Policy Optimization (GRPO) — the RL algorithm DeepSeek used for R1 — on a deliberately small model: Qwen2.5-0.5B-Instruct. It uses Hugging Face TRL's GRPOTrainer and the GSM8K grade-school math dataset. The whole point is observable: before training, the 0.5B model just blurts an answer; after a short GRPO run, it learns to emit a structured <reasoning>…</reasoning><answer>…</answer> chain and reason its way to the answer — driven entirely by reward, with no supervised reasoning traces.

It runs on one GPU (~17 GB, a few hours on a 3090). That accessibility is the feature: it makes the GRPO recipe something you can actually run and inspect, rather than a 100× larger thing you read about.

Why GRPO

The honest case for GRPO over PPO and DPO:

No critic network. PPO needs a separately trained value model. GRPO uses the mean reward of a sampled group as an implicit baseline — no critic, half the model memory.
No preference pairs. DPO needs curated chosen/rejected pairs. GRPO only needs a reward function, and for verifiable tasks like GSM8K the reward (is the final number correct?) is essentially free.
Reasoning-friendly. Sample K completions per prompt, reward each, normalize advantages within the group — completions that reason their way to the right answer get pushed up relative to those that don't.

The setup (real config)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct", torch_dtype=torch.bfloat16
).to("cuda")

dataset = load_dataset("openai/gsm8k", "main")["train"]  # grade-school math, #### marks the gold answer

training_args = GRPOConfig(
    learning_rate=5e-6, lr_scheduler_type="cosine", warmup_ratio=0.1,
    per_device_train_batch_size=1, gradient_accumulation_steps=4,
    num_generations=16,          # K completions sampled per prompt — the heart of GRPO
    max_prompt_length=256, max_completion_length=200,
    num_train_epochs=1, max_grad_norm=0.1, bf16=True,
    use_vllm=False,
)

The model is prompted with a system instruction to answer in a fixed structure:

<reasoning>
...
</reasoning>
<answer>
...
</answer>

Reward design (five functions)

GRPO's behavior is entirely shaped by the reward stack. This reproduction wires five reward functions into GRPOTrainer, deliberately split so each signal is observable:

Reward function	Signal	Value
`correctness_reward_func`	extracted `<answer>` exactly matches the GSM8K gold	+2.0 else 0
`int_reward_func`	the answer is a pure integer	+0.5 else 0
`strict_format_reward_func`	output matches the exact `<reasoning>\n…\n</reasoning>\n<answer>…` regex	+0.5 else 0
`soft_format_reward_func`	output loosely contains both blocks	+0.5 else 0
`xmlcount_reward_func`	graded credit per well-formed tag (0.125 each), minus a tiny penalty for trailing text	up to +0.5

Correctness dominates; the four format rewards shape the structure of the reasoning so it stays parseable. This staircase — strict format, soft format, tag-counting — is what reliably pulls a tiny model toward clean <reasoning>/<answer> output instead of free-form text.

What actually happens

This is an honest, demo-scale run, so the headline is qualitative, not a benchmark number:

Before GRPO: asked "Joy can read 8 pages in 20 minutes. How many hours to read 120 pages?", the base 0.5B model just emits a bare answer — no reasoning, often no structure.
After GRPO (1 epoch on GSM8K): the model reliably produces a <reasoning> block that works the problem step by step, then a <answer> block with the number. Reward alone taught it to show its work.

The notebook doesn't claim an accuracy SOTA — a 0.5B model on a few hours of GRPO won't beat much. What it demonstrates cleanly is the mechanism: a group of sampled completions + verifiable rewards is enough to induce reasoning-chain behavior, which is exactly the recipe DeepSeek-R1 scaled up.

Honest scope

Model: 0.5B (Qwen2.5-0.5B-Instruct) — ~100× smaller than the R1-scale models this recipe targets.
Data/compute: full GSM8K, 1 epoch, ~17 GB VRAM, a few hours on a single 3090.
No vLLM, no distributed training — use_vllm=False, single device. This is a learning reproduction, not a production trainer.

What this project signals

You understand the GRPO recipe behind DeepSeek-R1 well enough to run it end to end, not just describe it.
You can design a stack of verifiable reward functions and reason about how each shapes behavior.
You can tell an honest story — demonstrating a mechanism on a small model, and being clear about what is and isn't claimed.

Demo strategy

The live demo actually runs

Not a replay: the demo is a live reward calculator. Edit a model completion and the gold answer, and all five reward functions (correctness / int / strict_format / soft_format / xmlcount) — ported verbatim from the notebook — recompute in your browser. It's the exact scoring GRPOTrainer applies before the group-relative advantage. Bilingual (EN/中文).

Public preview can be enabled later without redesigning the case-study layout