Context Engineering Middleware

"Stuff everything into a 200K window" is an anti-pattern — Context Rot: the fuller the window, the more the model ignores middle content (the "fishbowl model"). Context engineering is disciplined management of that window: six functional modules × five strategies, all implemented as stackable LangChain middleware.

The six context modules

An agent's context window decomposes into six functional modules — you can't optimize what you can't size:

System prompt assembly
Conversation history
Memory retrieval & injection
Tool context
Task state / scratchpad
External knowledge / RAG

Conversation history + tool context together are often >50% — the two biggest "taps," so optimization targets them first.

The five strategies

Strategy	What it does
Write	Offload info to a scratchpad / CLAUDE.md / todo.md / files — out of the window
Select	JIT load on demand: RAG / glob / grep / dynamic tool assembly
Compress	Keep "decision + why," drop execution detail
Isolate	Sub-agent context partitioning
Cache	Reuse a static prefix — saves money, not window

Compress isn't one trick — it's a family

The richest family, four sub-techniques:

Compaction: whole-context compress & restart
Hard truncation / LLM summary
Tool Result Clearing: drop verbose raw tool output
Observation Masking: mask old observations

Core principle: keep "what decision and why," drop raw execution detail.

All realized as LangChain middleware

The strategies aren't hand-rolled if-else — they're stackable middleware:

# trim_messages           hard truncation, zero cost
# SummarizationMiddleware  LLM summary (placed in messages, not system, to preserve the cache prefix)
# SubAgentMiddleware       from the deepagents package, auto-injects a task tool for delegation

Note SummarizationMiddleware puts the summary in messages, not system — editing system would break the prefix cache.

Decision priority: Cache first, Isolate later

Which strategy first? "Zero-cost first, complex on demand":

1. Cache               saves 45–90%, day one
2. Compress/tool-result clearing   zero cost, day one
3. Compress/observation masking + trim   first week
4. Isolate/sub-agent   on demand (has architecture cost)
5. Write + Select      on demand

Prompt-caching specifics

Anthropic: cache-read is only 10% of standard (90% savings); min cache unit 1024 tokens (Sonnet 4.5/4, Opus 4.1/4) / 2048 (Sonnet 4.6) / 4096 (Opus 4.5/4.6, Haiku 4.5); TTL default 5 min (some 1 hr); cacheable fields system / messages / tools
DeepSeek: prefix caching is server-side automatic (prompt_cache_hit_tokens / miss_tokens), warming up around the 4th–6th request
Three cache rules: system prompt first, prefix byte-stable, zero code changes

What this signals

Understanding Context Rot: "bigger window ≠ better," so manage actively instead of stuffing
A six-modules × five-strategies framework: you can size each module and name the right strategy
Engineering it: strategies = stackable middleware, not ad-hoc scripts
Ordered savings: Cache first (90% off) → Isolate on demand — zero-cost before complex

Demo strategy

What the demo replays

The demo replays the decision priority: six modules fill the window → stack 'Cache → tool-result clearing → observation masking + trim → Isolate → Write/Select' in order, dropping both window tokens and relative cost. The priority order, the five strategies, and the 90%-off cache mechanics come from the Part 9 courseware; specific numbers are illustrative.

Public preview can be enabled later without redesigning the case-study layout