Qwen3-VL Visual RL with Unsloth + GSPO
Reinforcement learning for a vision-language model on a single consumer GPU: Unsloth + GSPO fine-tunes Qwen3-VL 8B on MathVista visual math, teaching it to emit structured reasoning and a numeric answer.
A reproducible, single-GPU walkthrough of vision-language reinforcement learning. The base model is Qwen3-VL 8B, loaded in 4-bit through Unsloth and tuned with GSPO (sequence-level GRPO) on MathVista. The goal is methodological clarity — data → reward → training → before/after evaluation — not a leaderboard number.
Overview
The task is deliberately concrete: give a vision-language model an image of a chart or figure, and have it solve the math question about it — writing out its reasoning first, then a single numeric answer. The base policy is Qwen3-VL 8B Instruct, and it is nudged toward better behavior with reinforcement learning rather than more supervised data.
What makes the project practical is that the whole loop runs on one consumer GPU (e.g. RTX 4090 / 3090). That is
possible because of Unsloth: 4-bit quantization plus LoRA cuts the memory footprint by more than half, and
Unsloth ships a FastVisionModel wrapper that already exposes the GRPO/GSPO interface and reward-function hooks. The
effort goes into data and reward design, not into the plumbing of getting images into the model.
Pipeline
The notebook is organized as eight stages, each independently inspectable:
- Environment — Unsloth,
transformers 4.57.0,trl 0.22.2, bitsandbytes, PEFT. - Model load —
FastVisionModel.from_pretrained("unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit"), 4-bit,max_seq_length = 16384(VLMs need a long context: image tokens + prompt + long reasoning). - LoRA wrap —
r = 16,lora_alpha = 16, applied to the language, attention and MLP layers only. The vision encoder is frozen — current vLLM weight-sharing does not support LoRA on vision layers, and we are tuning how the model reasons about what it sees, not its visual acuity. - Data —
AI4Math/MathVista(testmini), filtered to numeric answers, images resized to 512×512 RGB, then reshaped into a<REASONING>…</REASONING><SOLUTION>…</SOLUTION>chat prompt via the Qwen-VL chat template. 100 samples are held out for evaluation. - Baseline — run the untuned model over the eval set and record per-sample
pred / correct / format_ok. - Training —
GRPOTrainerwith two reward functions; GSPO is switched on through config (see below). - Post-training — re-evaluate on the same held-out set and diff against the baseline records.
- Export — save the LoRA adapter, verify the A/B matrices are non-zero, optionally merge to 16/4-bit or GGUF.
Reward design
RL outcomes live or die on the reward. This project keeps two interpretable scalar rewards:
- Formatting reward (weight
0.3):+1for exactly one<REASONING>…</REASONING>block,+1for exactly one<SOLUTION>…</SOLUTION>block. It also subtracts2.0when the output degenerates into repeatedaddCriteriontokens — a known Qwen-VL quirk where the model spams a config-like token; penalizing it keeps the policy from collapsing into gibberish. - Correctness reward (weight
1.0): extract the<SOLUTION>value, then2.0for an exact string match against ground truth,1.5for a numeric match (3vs3.0),0otherwise.
The two are combined as R_total = 0.3 · R_format + 1.0 · R_correct.
What GSPO actually changes
In this codebase GSPO is not a separate algorithm or trainer — it is GRPO with the importance ratio computed at the sequence level. The training config makes the switch explicit:
training_args = GRPOConfig(
learning_rate = 5e-6,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
per_device_train_batch_size = 8,
gradient_accumulation_steps = 4,
num_generations = 4, # K candidates per prompt
max_prompt_length = 1024,
max_completion_length = 1024,
num_train_epochs = 1,
max_grad_norm = 0.1,
# ↓ this is what turns GRPO into GSPO
importance_sampling_level = "sequence",
mask_truncated_completions = False,
loss_type = "dr_grpo",
)
For each prompt the trainer samples num_generations = 4 completions, scores them with the two reward functions,
normalizes advantages within the group, and applies the update with the importance correction at the whole-sequence
level rather than per token. Sequence-level credit assignment is steadier on long visual-reasoning chains, where a
single mis-sampled token would otherwise dominate a token-level gradient.
Results
Honest, demo-scale numbers from the held-out 100-sample eval (short training run on a single GPU):
| Metric | Before RL | After RL |
|---|---|---|
| Answer accuracy | 5.0% | 6.0% |
| Format compliance | 77.0% | 84.0% |
The accuracy needle barely moves — MathVista is hard and this is a deliberately short run — but the format
compliance jump (77% → 84%) is the honest signal: GSPO reliably pushed the policy toward the required
<REASONING>/<SOLUTION> structure and away from the addCriterion failure mode. The project ships the actual
baseline_records.json / after_records.json so every before/after pair is auditable rather than asserted.
What this project signals
- Working knowledge of the modern LLM-RL stack — TRL's GRPO/GSPO, not textbook PPO.
- Multimodal RL on a budget — Unsloth 4-bit + LoRA makes VLM RL runnable on one consumer GPU.
- Reward engineering — two interpretable rewards, including a real-world failure-mode penalty.
- Evaluation discipline — fixed held-out set, per-sample records, before/after diffing instead of a single number.
The live demo actually runs
Not a replay: the demo is a live reward calculator. Edit a VLM completion and the gold answer, and the two real reward functions (formatting with the addCriterion penalty, λ=0.3; correctness exact-2.0/numeric-1.5, λ=1.0) — ported verbatim from the notebook — recompute in your browser. The real before/after eval (accuracy 5%→6%, format 77%→84%) comes from the project's own records. Bilingual (EN/中文).