GSPO training step replay

Qwen3-VL Visual RL with Unsloth + GSPO

Single-GPU visual RL: Unsloth + GSPO fine-tunes Qwen3-VL 8B on MathVista visual math, lifting output-format compliance from 77% to 84% with format + correctness rewards.

A live reward calculator, not a replay. Edit a VLM completion and the gold answer; the two real reward functions — formatting (λ=0.3, with the addCriterion penalty) and correctness (λ=1.0, exact 2.0 / numeric 1.5) — recompute in your browser.

Qwen3-VLGSPOUnslothLoRATRLMathVista

Case Study Source Code

Why this local version exists

The two reward functions are pure string logic, ported verbatim, so they run client-side. The real before/after eval (accuracy 5%→6%, format 77%→84%) comes from the project's own records.

Live · runs in your browser

GSPO reward calculator — the real two functions

Edit a VLM completion and the gold answer. The formatting reward (λ=0.3, with the real addCriterion penalty) and correctness reward (λ=1.0, exact 2.0 / numeric 1.5 / else 0) — ported verbatim from the notebook — recompute live. This is the exact reward each sampled completion gets before the GSPO update.

Presets

Gold answer

VLM completion (editable)

Total reward

2.60

0.3 · format + 1.0 · correctness

formatting · λ=0.30.60

one <REASONING> +1one <SOLUTION> +1

correctness · λ=1.02.00

exact string match "991" → 2.0

Note the gap between this reward and the eval metric: the records mark 991.0 as incorrect (strict string match), yet here it earns 1.5 (numeric match). That nuance is why the format-compliance gain (77%→84%) is cleaner than the raw accuracy gain.

What to try

Load a preset, then edit the completion — both rewards update live.

Try the "numeric match" preset: 991.0 earns correctness 1.5 even though strict eval marks it wrong.

Paste addCriterion spam and watch the formatting reward take the −2 penalty.

What this demo proves

You can run the modern LLM-RL stack (TRL GRPO/GSPO), not textbook PPO.

You know how GSPO differs from GRPO — sequence-level importance sampling, one config flag.

You report honest before/after metrics (acc 5%→6%, format 77%→84%) from auditable records, not a single hero number.

Stack

Unsloth 4-bit + LoRA · TRL GRPOTrainer · Qwen3-VL 8B

Rewards

format (λ=0.3, addCriterion penalty) + correctness (λ=1.0)

Real result

Format compliance 77% → 84% on held-out eval

Back to case study