Qwen3-VL Visual RL with Unsloth + GSPO

A reproducible, single-GPU walkthrough of vision-language reinforcement learning. The base model is Qwen3-VL 8B, loaded in 4-bit through Unsloth and tuned with GSPO (sequence-level GRPO) on MathVista. The goal is methodological clarity — data → reward → training → before/after evaluation — not a leaderboard number.

Overview

The task is deliberately concrete: give a vision-language model an image of a chart or figure, and have it solve the math question about it — writing out its reasoning first, then a single numeric answer. The base policy is Qwen3-VL 8B Instruct, and it is nudged toward better behavior with reinforcement learning rather than more supervised data.

What makes the project practical is that the whole loop runs on one consumer GPU (e.g. RTX 4090 / 3090). That is possible because of Unsloth: 4-bit quantization plus LoRA cuts the memory footprint by more than half, and Unsloth ships a FastVisionModel wrapper that already exposes the GRPO/GSPO interface and reward-function hooks. The effort goes into data and reward design, not into the plumbing of getting images into the model.

Pipeline

The notebook is organized as eight stages, each independently inspectable:

Environment — Unsloth, transformers 4.57.0, trl 0.22.2, bitsandbytes, PEFT.
Model load — FastVisionModel.from_pretrained("unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit"), 4-bit, max_seq_length = 16384 (VLMs need a long context: image tokens + prompt + long reasoning).
LoRA wrap — r = 16, lora_alpha = 16, applied to the language, attention and MLP layers only. The vision encoder is frozen — current vLLM weight-sharing does not support LoRA on vision layers, and we are tuning how the model reasons about what it sees, not its visual acuity.
Data — AI4Math/MathVista (testmini), filtered to numeric answers, images resized to 512×512 RGB, then reshaped into a <REASONING>…</REASONING><SOLUTION>…</SOLUTION> chat prompt via the Qwen-VL chat template. 100 samples are held out for evaluation.
Baseline — run the untuned model over the eval set and record per-sample pred / correct / format_ok.
Training — GRPOTrainer with two reward functions; GSPO is switched on through config (see below).
Post-training — re-evaluate on the same held-out set and diff against the baseline records.
Export — save the LoRA adapter, verify the A/B matrices are non-zero, optionally merge to 16/4-bit or GGUF.

Reward design

RL outcomes live or die on the reward. This project keeps two interpretable scalar rewards:

Formatting reward (weight 0.3): +1 for exactly one <REASONING>…</REASONING> block, +1 for exactly one <SOLUTION>…</SOLUTION> block. It also subtracts 2.0 when the output degenerates into repeated addCriterion tokens — a known Qwen-VL quirk where the model spams a config-like token; penalizing it keeps the policy from collapsing into gibberish.
Correctness reward (weight 1.0): extract the <SOLUTION> value, then 2.0 for an exact string match against ground truth, 1.5 for a numeric match (3 vs 3.0), 0 otherwise.

The two are combined as R_total = 0.3 · R_format + 1.0 · R_correct.

What GSPO actually changes

In this codebase GSPO is not a separate algorithm or trainer — it is GRPO with the importance ratio computed at the sequence level. The training config makes the switch explicit:

training_args = GRPOConfig(
    learning_rate = 5e-6,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 4,
    num_generations = 4,        # K candidates per prompt
    max_prompt_length = 1024,
    max_completion_length = 1024,
    num_train_epochs = 1,
    max_grad_norm = 0.1,
    # ↓ this is what turns GRPO into GSPO
    importance_sampling_level = "sequence",
    mask_truncated_completions = False,
    loss_type = "dr_grpo",
)

For each prompt the trainer samples num_generations = 4 completions, scores them with the two reward functions, normalizes advantages within the group, and applies the update with the importance correction at the whole-sequence level rather than per token. Sequence-level credit assignment is steadier on long visual-reasoning chains, where a single mis-sampled token would otherwise dominate a token-level gradient.

Results

Honest, demo-scale numbers from the held-out 100-sample eval (short training run on a single GPU):

Metric	Before RL	After RL
Answer accuracy	5.0%	6.0%
Format compliance	77.0%	84.0%

The accuracy needle barely moves — MathVista is hard and this is a deliberately short run — but the format compliance jump (77% → 84%) is the honest signal: GSPO reliably pushed the policy toward the required <REASONING>/<SOLUTION> structure and away from the addCriterion failure mode. The project ships the actual baseline_records.json / after_records.json so every before/after pair is auditable rather than asserted.

What this project signals

Working knowledge of the modern LLM-RL stack — TRL's GRPO/GSPO, not textbook PPO.
Multimodal RL on a budget — Unsloth 4-bit + LoRA makes VLM RL runnable on one consumer GPU.
Reward engineering — two interpretable rewards, including a real-world failure-mode penalty.
Evaluation discipline — fixed held-out set, per-sample records, before/after diffing instead of a single number.

Demo strategy

The live demo actually runs

Not a replay: the demo is a live reward calculator. Edit a VLM completion and the gold answer, and the two real reward functions (formatting with the addCriterion penalty, λ=0.3; correctness exact-2.0/numeric-1.5, λ=1.0) — ported verbatim from the notebook — recompute in your browser. The real before/after eval (accuracy 5%→6%, format 77%→84%) comes from the project's own records. Bilingual (EN/中文).

Public preview can be enabled later without redesigning the case-study layout