Move past fitting tokens into the window and learn the discipline of context engineering: curating the smallest high-signal token set, fighting context rot, and applying write, select, compress, and isolate strategies plus tool-result pruning and sub-agent isolation.
Long-context window management answers the mechanics of making a big context window work: KV-cache math, RoPE scaling, prefill-versus-decode bottlenecks, and where the lost-in-the-middle penalty bites. It answers "how do I fit more tokens into one model call and serve it cheaply?"
The engineering question changes: "given that I can fit a lot of tokens, what should go in the window, and what should stay out?" That discipline is called context engineering. Window management provides capacity; context engineering decides what evidence, tools, history, and state deserve that capacity.
The shift matters because more tokens don't guarantee better answers. A model may accept a very large prompt yet perform worse on a workload than it does with a smaller curated packet. The job is no longer only "make room." It's also to compare curation policies on answer quality, latency, and cost.
Picture an incident agent investigating why deploy RUN-842 failed its canary. It reads the deploy record, checks CI logs, searches rollback runbooks, and follows request traces. Every tool call dumps its raw output back into the conversation. After forty turns the context holds: the original alert, four full runbooks (most of which were a dead end), six multi-thousand-token trace exports, three issue-search results, and one early hallucinated guess that a database migration lock caused the outage.
The agent now performs worse than it did at turn five. It re-runs the dead-end issue search, cites a migration lock that doesn't exist, and picks a verbose, irrelevant rollback paragraph over the trace span that matters. The window is nowhere near full, yet the agent is failing.
This isn't a window-management problem. The KV cache fits, latency is fine, and RoPE isn't stretched. The context has accumulated low-signal and even wrong tokens, and the model is dutifully attending to all of them. Context engineering is the set of techniques that would have prevented this failure. The failed-canary agent will anchor the rest of the chapter.
For a few years the applied-AI conversation was dominated by prompt engineering: finding the right words and phrasing for a single instruction. Anthropic frames context engineering as the natural successor to that practice.[1] Prompt engineering is about writing one good instruction. Context engineering is the broader discipline of curating and maintaining the entire set of tokens present during inference: the system prompt, tool definitions, retrieved documents, conversation history, tool results, and any memory loaded back in.
Anthropic's framing is worth memorizing because it gives you a single guiding principle:
Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.[1]
That's the whole job in one sentence. Every technique below is a way to push toward that minimal high-signal set. A 2025 survey of the field collects the same strategies under a formal taxonomy, confirming this is now a named subdiscipline rather than a collection of tips.[2]
The reason you shouldn't assume "just add more" is empirical, not stylistic. Chroma's Context Rot report evaluated 18 models across increasing input lengths and reported non-uniform performance as input grew, including on simple retrieval and copying tasks.[3] The magnitude and shape depend on model, task, and distractors; additional tokens are a hypothesis to evaluate, not free signal.
Anthropic describes the same engineering concern with an "attention budget" mental model and recommends seeking the smallest high-signal token set that supports the desired outcome.[1] Padding a window with low-signal tokens always increases input cost and can lower workload quality; a paired evaluation should determine when.
Context rot is one reason behind the techniques below. Cost, latency, stale state, and contradictory evidence are others. Curation should be an explicit candidate policy with quality checks, not an article of faith.
Before fixing context, you need vocabulary for how it breaks. Drew Breunig describes four useful failure-mode labels for long contexts.[4] They are diagnostic categories, not a formal completeness proof. Each one can show up in our failed-canary agent.
| Failure mode | What it's | Symptom in the failed-canary agent |
|---|---|---|
| Poisoning | A hallucination or error enters the context and is then referenced repeatedly | The early wrong guess about a migration lock keeps getting cited |
| Distraction | The context grows so long the model over-focuses on its history and stops forming new plans | The agent re-runs the dead-end issue search instead of trying something new |
| Confusion | Superfluous content (often too many tools) drives a low-quality response | With dozens of tools loaded, the agent picks a cluster-admin tool |
| Clash | New information or instructions conflict with earlier ones in the context | A tool defined in XML contradicts the system rule to answer only in JSON |
Don't turn reported examples into universal thresholds. Breunig cites an agent anecdote where long history encouraged repeated actions and a tool-use experiment where reducing tool count improved one model's benchmark result.[4] Those observations justify testing history pruning and tool gating on your model, tools, and task distribution.
1def diagnose_context(signals: set[str]) -> tuple[str, str]:
2 routes = [
3 ("wrong_fact_repeated", "poisoning", "remove disproven spans and rebuild notes"),
4 ("old_trace_replayed", "distraction", "compact old history and retain decisions"),
5 ("irrelevant_tool_called", "confusion", "gate tools for the current phase"),
6 ("rules_disagree", "clash", "reconcile conflicting instructions"),
7 ]
8 for signal, mode, action in routes:
9 if signal in signals:
10 return mode, action
11 return "unknown", "inspect trace and add an evaluation case"
12
13mode, action = diagnose_context({"wrong_fact_repeated", "old_trace_replayed"})
14print(f"diagnosis={mode}; next_action={action}")1diagnosis=poisoning; next_action=remove disproven spans and rebuild notes
Naming failure modes tells you what went wrong. LangChain organizes agent context strategies into four useful buckets: write, select, compress, and isolate.[5] A useful mental model from that work: the LLM is like a CPU and its context window is like RAM, a working set to manage deliberately. The buckets classify many common tactics without claiming they exhaust every design.
The cheapest token is the one you never put in the window. Write means persisting information outside the context so it doesn't consume the attention budget until it's needed.[5] The classic pattern is a scratchpad: the agent writes notes, plans, or intermediate findings to a file or a state field, then reloads only the relevant note later. Agentic memory works the same way, persisting durable facts across sessions.[6]
For the failed-canary agent, a write strategy means: instead of leaving four full runbooks in the conversation, the agent records "auth callback errors confirmed; migration lock ruled out" as a one-line note and drops the raw tool results. The finding survives; the tokens don't.
1def promote_findings(tool_results: list[dict]) -> tuple[list[str], list[str]]:
2 notes, discarded_raw = [], []
3 for result in tool_results:
4 if result["confirmed"]:
5 notes.append(f"{result['source']}: {result['finding']}")
6 discarded_raw.append(result["raw_output"])
7 return notes, discarded_raw
8
9notes, discarded = promote_findings([
10 {"source": "deploy_RUN_842", "finding": "auth callback errors confirmed", "confirmed": True, "raw_output": "..." * 600},
11 {"source": "db_lock_check", "finding": "migration lock ruled out", "confirmed": True, "raw_output": "..." * 900},
12])
13print("notes:", notes)
14print("raw_results_to_remove:", len(discarded))1notes: ['deploy_RUN_842: auth callback errors confirmed', 'db_lock_check: migration lock ruled out']
2raw_results_to_remove: 2Select means retrieving only the tokens relevant to the current step.[5] Retrieval-augmented generation (RAG) is the canonical example, and you already studied it: a high-recall retriever surfaces a handful of relevant chunks instead of dumping the whole corpus. But selection applies to more than documents. Tool selection is a select problem too: the confusion failure mode above is what happens when you fail to gate tools and load all 50 instead of the 5 this task needs.
1TOOLS_BY_PHASE = {
2 "investigate": {"deploy_lookup", "trace_lookup", "runbook_search"},
3 "resolve": {"runbook_search", "rollback_advisor", "page_oncall"},
4}
5
6def tools_for_phase(phase: str, available: set[str]) -> list[str]:
7 allowed = TOOLS_BY_PHASE.get(phase, set())
8 return sorted(allowed & available)
9
10available = {"deploy_lookup", "trace_lookup", "runbook_search", "rollback_advisor", "cluster_admin"}
11print("investigate tools:", tools_for_phase("investigate", available))1investigate tools: ['deploy_lookup', 'runbook_search', 'trace_lookup']Prompt caching, which you met earlier, is an orthogonal efficiency tactic: if a stable prefix is reused, a provider may avoid repeating part of the prefill computation.[7] Caching does not select better evidence or reduce the number of tokens the model reasons over. It improves eligible repeated-prefix economics only when the cache policy and provider support it.
1import hashlib
2
3def prefix_key(system_prompt: str, stable_docs: str) -> str:
4 payload = system_prompt + "\n" + stable_docs
5 return hashlib.sha256(payload.encode()).hexdigest()[:12]
6
7stable = prefix_key("Use cited policy only.", "Policy version: 7")
8same_prefix = prefix_key("Use cited policy only.", "Policy version: 7")
9changed_prefix = prefix_key("Use cited policy only.", "Policy version: 8")
10print("reuse eligible:", stable == same_prefix)
11print("changed source invalidates candidate:", stable != changed_prefix)1reuse eligible: True
2changed source invalidates candidate: TrueWhen information has to stay in the window, compress reduces it to the required tokens.[5] Two common candidate patterns follow.
The first is compaction: when a conversation approaches the budget, summarize it and start a fresh window seeded with that summary.[1] The agent keeps its working knowledge but sheds the verbose transcript that produced it.
The second is tool-result pruning, a low-risk candidate fix for our failed-canary agent. Anthropic describes clearing old tool results as a light-touch form of compaction: once the relevant finding has been captured, old raw results can often leave the active context.[1] Preserve evidence that the next decision still needs, and compare quality before adopting an aggressive pruning policy.
1def prune_results(results: list[dict], keep_recent: int) -> list[str]:
2 retained = []
3 cutoff = max(0, len(results) - keep_recent)
4 for index, result in enumerate(results):
5 if index < cutoff and result["finding_recorded"]:
6 retained.append(f"[pruned raw output] finding={result['finding']}")
7 else:
8 retained.append(result["raw"])
9 return retained
10
11history = [
12 {"raw": "old trace span" * 100, "finding": "auth errors confirmed", "finding_recorded": True},
13 {"raw": "latest canary trace", "finding": "rollback review needed", "finding_recorded": False},
14]
15pruned = prune_results(history, keep_recent=1)
16print(pruned[0])
17print(pruned[1])1[pruned raw output] finding=auth errors confirmed
2latest canary traceIsolate means splitting context across focused workers so no single window has to hold every exploratory trace.[5] A lead agent delegates a focused subtask, such as "check linked issues and rollback notes for every failed-canary exception", to a worker with its own clean context window. The worker does noisy exploration in isolation and returns a bounded evidence summary to the lead.[1]
Isolation can reduce distraction and confusion because the lead doesn't need every intermediate search result. It also adds coordination overhead and creates clash risk when worker outputs disagree, so isolate is a candidate for separable exploration, not a default.
1def accept_handoff(handoff: dict, token_limit: int = 400) -> bool:
2 required = {"claim", "evidence", "next_check", "tokens"}
3 return required <= handoff.keys() and handoff["tokens"] <= token_limit
4
5handoff = {
6 "claim": "auth callback failure qualifies for staged rollback",
7 "evidence": "trace span 2026-05-28 and rollback runbook section 4",
8 "next_check": "quote rollback blast radius for RUN-842",
9 "tokens": 86,
10}
11print(f"bounded handoff accepted: {accept_handoff(handoff)}")1bounded handoff accepted: TrueWith the taxonomy in hand, the running example fixes itself. Instead of letting the window grow monotonically, the agent runs a curation step before each model call:
The window now holds the alert, current evidence, a short notes block, and a clean tool set. In the concrete script below, it shrinks an illustrative 12.4K-token raw set to a 3.2K-token active window plus short external notes. Lower input cost is immediate; better task performance still requires evaluation.
The logic is simple enough to simulate with a tiny script. This version turns the failed-canary agent's messy transcript into a compact working set by keeping the current alert and evidence, writing durable findings to notes, and dropping poisoned or irrelevant tokens.
1from dataclasses import dataclass
2
3@dataclass
4class ContextItem:
5 name: str
6 kind: str
7 tokens: int
8 signal: int
9 keep: str
10
11items = [
12 ContextItem("alert_RUN_842", "task", 180, 10, "window"),
13 ContextItem("latest_trace_span", "evidence", 420, 10, "window"),
14 ContextItem("rollback_runbook", "evidence", 1800, 9, "window"),
15 ContextItem("scratchpad", "notes", 260, 8, "window"),
16 ContextItem("deploy_lookup_tool", "tool", 240, 8, "window"),
17 ContextItem("trace_lookup_tool", "tool", 260, 8, "window"),
18 ContextItem("old_trace_export", "log", 8200, 2, "drop"),
19 ContextItem("cluster_admin_tool", "tool", 360, 1, "drop"),
20 ContextItem("issue_search_tool", "tool", 410, 1, "drop"),
21 ContextItem("wrong_migration_lock_guess", "poison", 120, 0, "drop"),
22 ContextItem("auth_errors_confirmed", "finding", 90, 7, "notes"),
23 ContextItem("migration_lock_ruled_out", "finding", 96, 7, "notes"),
24]
25
26def summarize(selection):
27 return ", ".join(item.name for item in selection)
28
29raw_total = sum(item.tokens for item in items)
30window_items = [item for item in items if item.keep == "window"]
31notes_items = [item for item in items if item.keep == "notes"]
32dropped_items = [item for item in items if item.keep == "drop"]
33
34curated_total = sum(item.tokens for item in window_items)
35notes_total = sum(item.tokens for item in notes_items)
36
37print(f"raw_tokens={raw_total}")
38print(f"curated_window_tokens={curated_total}")
39print(f"external_notes_tokens={notes_total}")
40print(f"removed_tokens={raw_total - curated_total - notes_total}")
41print("window:", summarize(window_items))
42print("notes:", summarize(notes_items))
43print("dropped:", summarize(dropped_items))1raw_tokens=12436
2curated_window_tokens=3160
3external_notes_tokens=186
4removed_tokens=9090
5window: alert_RUN_842, latest_trace_span, rollback_runbook, scratchpad, deploy_lookup_tool, trace_lookup_tool
6notes: auth_errors_confirmed, migration_lock_ruled_out
7dropped: old_trace_export, cluster_admin_tool, issue_search_tool, wrong_migration_lock_guess1from dataclasses import dataclass
2
3@dataclass
4class Candidate:
5 name: str
6 tokens: int
7 priority: int
8 required: bool = False
9
10def pack_working_set(candidates: list[Candidate], budget: int) -> list[str]:
11 ordered = sorted(candidates, key=lambda item: (not item.required, -item.priority))
12 selected, used = [], 0
13 for item in ordered:
14 if used + item.tokens <= budget:
15 selected.append(item.name)
16 used += item.tokens
17 elif item.required:
18 raise ValueError(f"required item does not fit: {item.name}")
19 return selected
20
21items = [
22 Candidate("alert RUN-842", 180, 10, required=True),
23 Candidate("latest trace span", 420, 10, required=True),
24 Candidate("rollback runbook", 1_800, 9),
25 Candidate("stale trace export", 8_200, 1),
26]
27print(pack_working_set(items, budget=3_000))1['alert RUN-842', 'latest trace span', 'rollback runbook']Large-window model offerings make it tempting to treat curation as obsolete.[8] Bigger windows raise the capacity ceiling; they don't establish quality for an overloaded agent trace. Frameworks such as LangGraph expose short- and long-term memory plus summarization or deletion patterns because state management remains an application responsibility.[9]
The practical upshot: when an agent underperforms, inspect the transmitted context alongside model and window choices. Name a suspected failure mode, apply a bounded candidate change, and measure whether it improves the task.
| Symptom | Likely cause | Fix |
|---|---|---|
| Agent gets worse over a long session despite spare window | Distraction from accumulated stale history | Compact the transcript; prune old tool results |
| Agent keeps citing a wrong fact | Context poisoning: an early error is being re-referenced | Remove the bad tokens from history; don't just add a correction |
| Agent picks irrelevant tools or ignores the right one | Context confusion from overlapping active tools | Gate tools per phase; evaluate a smaller active set |
| Model violates a format rule when a Model Context Protocol (MCP) tool is attached | Context clash between tool instructions and system rules | Reconcile instructions or isolate the tool behind a sub-agent |
| Costs balloon and latency rises with no quality gain | Stuffing the window instead of curating it | Apply the smallest-high-signal-set principle: select and compress |
| Sub-agent answers conflict with each other | Isolation without reconciliation | Have the lead agent resolve clashes before acting |
Use this checklist before shipping an agent that handles long sessions:
When a long-running agent underperforms, use this sequence:
1def rebuild_notes(notes: list[dict]) -> list[str]:
2 return [
3 note["text"]
4 for note in notes
5 if note["status"] != "disproven"
6 ]
7
8notes = [
9 {"text": "auth callback errors confirmed", "status": "confirmed"},
10 {"text": "migration lock caused RUN-842", "status": "disproven"},
11 {"text": "database lock ruled out", "status": "confirmed"},
12]
13print("rebuilt notes:", rebuild_notes(notes))1rebuilt notes: ['auth callback errors confirmed', 'database lock ruled out']1def approve_curation(
2 baseline_accuracy: float,
3 curated_accuracy: float,
4 baseline_tokens: int,
5 curated_tokens: int,
6 poisoned_references_after: int,
7) -> bool:
8 quality_ok = curated_accuracy >= baseline_accuracy
9 cost_ok = curated_tokens < baseline_tokens
10 poisoning_removed = poisoned_references_after == 0
11 return quality_ok and cost_ok and poisoning_removed
12
13approved = approve_curation(
14 baseline_accuracy=0.82,
15 curated_accuracy=0.87,
16 baseline_tokens=12_436,
17 curated_tokens=3_346,
18 poisoned_references_after=0,
19)
20print(f"curated context approved: {approved}")1curated context approved: TrueLong-context window management gave you the machinery to hold many tokens. Context engineering adds the judgment to decide which tokens earn their place. The next chapter shifts to model architecture, where sparse routing changes the compute-per-token economics that all of this sits on top of.
Before moving on, make sure you can do all of these without hand-waving:
Because window size sets capacity, not quality. Context-rot evaluations report that reliability can decline as input grows, so a technically valid prompt still deserves comparison with a curated candidate. The relevant measurements are task quality, cost, latency, and whether the context contains the evidence needed for the answer.
Compress when the information must remain available to the same reasoning thread but is too verbose, for example a long transcript that you summarize or stale tool results you prune. Isolate when a subtask is noisy and self-contained, such as a broad search that will generate dozens of intermediate results the main agent never needs to see. Isolation buys the cleanest separation but adds coordination overhead and a risk of clash between sub-agent outputs, so prefer compression for in-thread bloat and reserve isolation for genuinely separable, exploration-heavy subtasks.
That's context poisoning: an error entered the context and is now being re-referenced on later turns. Appending a correction may not help, because both the wrong fact and the correction stay in view. A strong candidate fix is to remove disproven notes from the rebuilt context (or compact the transcript without them), then run an evaluation that checks whether the wrong fact is still re-cited.
There's no universal limit. Treat overlapping tool descriptions and irrelevant tool calls as symptoms to measure. Test a phase-gated tool set against the full registry on representative tasks, then retain the smallest set that preserves coverage and improves selection accuracy. That's a selection fix for the confusion failure mode.
Keep only the current alert, latest trace span, applicable rollback runbook, short scratchpad, and small tool set needed for this phase. Move durable findings such as "auth callback errors confirmed" or "migration lock ruled out" into notes so the knowledge survives without bloating the active window. Drop stale logs, irrelevant tools, and any poisoned guesses that the agent has already proven false. That answer uses the full taxonomy at once: write out durable facts, select current evidence, compress or prune old history, and isolate anything noisy enough to deserve its own clean sub-window.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Effective context engineering for AI agents
Anthropic · 2025
A Survey of Context Engineering for Large Language Models.
Multiple authors · 2025 · arXiv preprint
Context Rot: How Increasing Input Tokens Impacts LLM Performance
Hong, K., Troynikov, A., & Huber, J. · 2025
How Long Contexts Fail
Breunig, D. · 2025
Context Engineering for Agents
LangChain · 2025
Agent Memory: How to Build Agents that Learn and Remember
Letta · 2026
Prompt caching.
Anthropic. · 2026 · Official documentation
1M context is now generally available for Opus 4.6 and Sonnet 4.6
Anthropic · 2026
LangGraph Memory Overview
LangChain · 2026