Back to case study
The model × design formula

Harness Engineering in Practice

Output quality = model capability × design level. The four pillars of engineering an agent runtime (codebase-as-truth / mechanized constraints / feedback loops / entropy mgmt). Measured: model unchanged, the Harness alone lifts Terminal Bench 52.8% → 66.5%.

Install the four Harness pillars one by one and watch the benchmark climb from a bare model (52.8%) to a full Harness (66.5%) — with the model held constant.

Harness EngineeringClaude CodeHooksAgent Runtime
Harness Engineering in Practice

Why this local version exists

The 52.8% / 66.5% endpoints are the real LangChain measurements on Terminal Bench 2.0 (same GPT-5.2-Codex, harness-only change) cited in the course; the climb between is illustrative. Pillar names come from the Harness Engineering deck.

Interactive Preview

Same model — lift the benchmark with the Harness

Output quality = model capability × design level. Install the four Harness pillars — model unchanged (GPT-5.2-Codex) — and Terminal Bench 2.0 goes 52.8% → 66.5%.

① Codebase as source of truth

CLAUDE.md / AGENTS.md (~100 lines) — declarative project knowledge.

② Mechanized architectural constraints

"CLAUDE.md is advice, Hooks are law": PreToolUse / PostToolUse enforcement.

③ Feedback loops

Four levels: instant (Hooks) → build (CI/CD) → two cross-session layers.

④ Entropy management

Fight doc drift / architecture erosion / style drift / duplication.

Terminal Bench 2.0

52.8%

bare model 52.8%full Harness 66.5%

The climb is illustrative; 52.8% / 66.5% are the real measured endpoints from the course.

Harness completeness

0/4 pillars

What to try

Install the Harness and watch the four pillars light up in sequence.

Note the endpoints: 52.8% bare → 66.5% with the full Harness, model unchanged.

Compare against a model upgrade (+6.8pp) — the Harness is ~2× that gain.

What this demo proves

You internalize the "output quality = model capability × design level" multiplier.

You can name and apply the four pillars: codebase-as-truth, mechanized constraints (Hooks), feedback loops, entropy management.

You argue from measured benchmark data, not vibes — and can pick deep vs light Harness platforms by scenario.

Headline result

Terminal Bench 2.0: 52.8% → 66.5% (+13.7pp) from the Harness alone

Four pillars

Codebase-as-truth · mechanized constraints (Hooks) · feedback loops · entropy mgmt

Best signal

Agent-runtime engineering judgment, measured not vibed