veRL PPO Training
Classic PPO on a single GPU with ByteDance's veRL (HybridFlow): four models (Actor/Critic/Reference/Reward) + Ray, on GSM8K with a rule reward (regex ####). Paired with a close reading of InstructGPT's three-stage RLHF.
Replays one PPO iteration on a GSM8K problem: the Actor rolls out a CoT (#### 72), the rule reward scores 1.0, the Critic values it, advantage = reward − value, then Actor/Critic update — four model roles lighting up around the loop.
Why this local version exists
The four model roles, the seven-step loop, the reward regex (#### …), and the step:42 metrics (0.296 / 1702 tok/s) are from the veRL course ("LLM RL 强化学习训练入门"). No real training runs in the browser.
One PPO iteration (veRL · GSM8K)
Replays the veRL PPO loop on Qwen2.5-0.5B over GSM8K: four model roles (Actor/Reference/Reward/Critic) light up in turn, with a rule reward scoring 0/1 on the #### answer.
Four model roles (Ray)
FSDP + vLLM · HybridFlow · single H800
GSM8K problem
Natalia sold 48 clips in April, half as many in May. Total over both months?
1. Rollout: Actor generates a CoT answer
2. Reward: rule function scores (regex on ####)
3. Critic: estimates value
4. Advantage: A = reward − value (GAE)
5. Actor update (Clipped Objective)
6. Critic update ((value−reward)²)
reward: extract last 300 chars → /#### (\-?[0-9\.\,]+)/ → 1.0 correct / 0.0 wrong
What to try
Run one PPO step and watch Actor → Reward → Critic → Advantage → updates in sequence.
Note the rule reward: a regex on the #### answer gives a clean 1.0 / 0.0.
See the four roles (Actor/Reference/Reward/Critic) light up at the right stage.
What this demo proves
You can run classic RLHF PPO end-to-end on an industrial framework (veRL + Ray), on a single GPU.
You understand reward design: verifiable tasks (math) use a rule reward, not a trained RM.
You can place PPO within RLHF (SFT → RM → PPO) and read the InstructGPT source.
Framework
veRL (HybridFlow) · Ray · FSDP + vLLM · Qwen2.5-0.5B on GSM8K
Four roles
Actor (policy) · Critic (value) · Reference (KL) · Reward (rule)
RLHF origin
InstructGPT: SFT → RM → PPO; 1.3B beat 175B GPT-3 on preference