PPO loop replay

veRL PPO Training

Classic PPO on a single GPU with ByteDance's veRL (HybridFlow): four models (Actor/Critic/Reference/Reward) + Ray, on GSM8K with a rule reward (regex ####). Paired with a close reading of InstructGPT's three-stage RLHF.

Replays one PPO iteration on a GSM8K problem: the Actor rolls out a CoT (#### 72), the rule reward scores 1.0, the Critic values it, advantage = reward − value, then Actor/Critic update — four model roles lighting up around the loop.

veRLPPORLHFRayvLLM

Case Study Source Code

Why this local version exists

The four model roles, the seven-step loop, the reward regex (#### …), and the step:42 metrics (0.296 / 1702 tok/s) are from the veRL course ("LLM RL 强化学习训练入门"). No real training runs in the browser.

Interactive Preview

One PPO iteration (veRL · GSM8K)

Replays the veRL PPO loop on Qwen2.5-0.5B over GSM8K: four model roles (Actor/Reference/Reward/Critic) light up in turn, with a rule reward scoring 0/1 on the #### answer.

Four model roles (Ray)

Actor

Reference

Reward

Critic

FSDP + vLLM · HybridFlow · single H800

GSM8K problem

Natalia sold 48 clips in April, half as many in May. Total over both months?

1. Rollout: Actor generates a CoT answer

2. Reward: rule function scores (regex on ####)

3. Critic: estimates value

4. Advantage: A = reward − value (GAE)

5. Actor update (Clipped Objective)

6. Critic update ((value−reward)²)

reward: extract last 300 chars → /#### (\-?[0-9\.\,]+)/ → 1.0 correct / 0.0 wrong

What to try

Run one PPO step and watch Actor → Reward → Critic → Advantage → updates in sequence.

Note the rule reward: a regex on the #### answer gives a clean 1.0 / 0.0.

See the four roles (Actor/Reference/Reward/Critic) light up at the right stage.

What this demo proves

You can run classic RLHF PPO end-to-end on an industrial framework (veRL + Ray), on a single GPU.

You understand reward design: verifiable tasks (math) use a rule reward, not a trained RM.

You can place PPO within RLHF (SFT → RM → PPO) and read the InstructGPT source.

Framework

veRL (HybridFlow) · Ray · FSDP + vLLM · Qwen2.5-0.5B on GSM8K

Four roles

Actor (policy) · Critic (value) · Reference (KL) · Reward (rule)

RLHF origin

InstructGPT: SFT → RM → PPO; 1.3B beat 175B GPT-3 on preference

Back to case study