Qwen3-VL Visual RL with Unsloth + GSPO
Single-GPU visual RL: Unsloth + GSPO fine-tunes Qwen3-VL 8B on MathVista visual math, lifting output-format compliance from 77% to 84% with format + correctness rewards.
A live reward calculator, not a replay. Edit a VLM completion and the gold answer; the two real reward functions — formatting (λ=0.3, with the addCriterion penalty) and correctness (λ=1.0, exact 2.0 / numeric 1.5) — recompute in your browser.
Why this local version exists
The two reward functions are pure string logic, ported verbatim, so they run client-side. The real before/after eval (accuracy 5%→6%, format 77%→84%) comes from the project's own records.
GSPO reward calculator — the real two functions
Edit a VLM completion and the gold answer. The formatting reward (λ=0.3, with the real addCriterion penalty) and correctness reward (λ=1.0, exact 2.0 / numeric 1.5 / else 0) — ported verbatim from the notebook — recompute live. This is the exact reward each sampled completion gets before the GSPO update.
Presets
Gold answer
VLM completion (editable)
Total reward
2.60
0.3 · format + 1.0 · correctness
exact string match "991" → 2.0
Note the gap between this reward and the eval metric: the records mark 991.0 as incorrect (strict string match), yet here it earns 1.5 (numeric match). That nuance is why the format-compliance gain (77%→84%) is cleaner than the raw accuracy gain.
What to try
Load a preset, then edit the completion — both rewards update live.
Try the "numeric match" preset: 991.0 earns correctness 1.5 even though strict eval marks it wrong.
Paste addCriterion spam and watch the formatting reward take the −2 penalty.
What this demo proves
You can run the modern LLM-RL stack (TRL GRPO/GSPO), not textbook PPO.
You know how GSPO differs from GRPO — sequence-level importance sampling, one config flag.
You report honest before/after metrics (acc 5%→6%, format 77%→84%) from auditable records, not a single hero number.
Stack
Unsloth 4-bit + LoRA · TRL GRPOTrainer · Qwen3-VL 8B
Rewards
format (λ=0.3, addCriterion penalty) + correctness (λ=1.0)
Real result
Format compliance 77% → 84% on held-out eval