LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleContext Engineering
🚀HardInference Optimization

Context Engineering

Move past fitting tokens into the window and learn the discipline of context engineering: curating the smallest high-signal token set, fighting context rot, and applying write, select, compress, and isolate strategies plus tool-result pruning and sub-agent isolation.

23 min read
Learning path
Step 139 of 158 in the full curriculum
Long Context Window ManagementMixture of Experts Architecture

Long-context window management answers the mechanics of making a big context window work: KV-cache math, RoPE scaling, prefill-versus-decode bottlenecks, and where the lost-in-the-middle penalty bites. It answers "how do I fit more tokens into one model call and serve it cheaply?"

The engineering question changes: "given that I can fit a lot of tokens, what should go in the window, and what should stay out?" That discipline is called context engineering. Window management provides capacity; context engineering decides what evidence, tools, history, and state deserve that capacity.

The shift matters because more tokens don't guarantee better answers. A model may accept a very large prompt yet perform worse on a workload than it does with a smaller curated packet. The job is no longer only "make room." It's also to compare curation policies on answer quality, latency, and cost.

Motivation: the agent that got worse as it learned more

Picture an incident agent investigating why deploy RUN-842 failed its canary. It reads the deploy record, checks CI logs, searches rollback runbooks, and follows request traces. Every tool call dumps its raw output back into the conversation. After forty turns the context holds: the original alert, four full runbooks (most of which were a dead end), six multi-thousand-token trace exports, three issue-search results, and one early hallucinated guess that a database migration lock caused the outage.

The agent now performs worse than it did at turn five. It re-runs the dead-end issue search, cites a migration lock that doesn't exist, and picks a verbose, irrelevant rollback paragraph over the trace span that matters. The window is nowhere near full, yet the agent is failing.

This isn't a window-management problem. The KV cache fits, latency is fine, and RoPE isn't stretched. The context has accumulated low-signal and even wrong tokens, and the model is dutifully attending to all of them. Context engineering is the set of techniques that would have prevented this failure. The failed-canary agent will anchor the rest of the chapter.

Prompt engineering grew up into context engineering

For a few years the applied-AI conversation was dominated by prompt engineering: finding the right words and phrasing for a single instruction. Anthropic frames context engineering as the natural successor to that practice.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents Prompt engineering is about writing one good instruction. Context engineering is the broader discipline of curating and maintaining the entire set of tokens present during inference: the system prompt, tool definitions, retrieved documents, conversation history, tool results, and any memory loaded back in.

Anthropic's framing is worth memorizing because it gives you a single guiding principle:

Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

That's the whole job in one sentence. Every technique below is a way to push toward that minimal high-signal set. A 2025 survey of the field collects the same strategies under a formal taxonomy, confirming this is now a named subdiscipline rather than a collection of tips.[2]Reference 2A Survey of Context Engineering for Large Language Models.https://arxiv.org/abs/2507.13334

Why curation needs evaluation: context rot

The reason you shouldn't assume "just add more" is empirical, not stylistic. Chroma's Context Rot report evaluated 18 models across increasing input lengths and reported non-uniform performance as input grew, including on simple retrieval and copying tasks.[3]Reference 3Context Rot: How Increasing Input Tokens Impacts LLM Performancehttps://research.trychroma.com/context-rot The magnitude and shape depend on model, task, and distractors; additional tokens are a hypothesis to evaluate, not free signal.

Anthropic describes the same engineering concern with an "attention budget" mental model and recommends seeking the smallest high-signal token set that supports the desired outcome.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents Padding a window with low-signal tokens always increases input cost and can lower workload quality; a paired evaluation should determine when.

Context rot is one reason behind the techniques below. Cost, latency, stale state, and contradictory evidence are others. Curation should be an explicit candidate policy with quality checks, not an article of faith.

Four failure modes of an overloaded context

Before fixing context, you need vocabulary for how it breaks. Drew Breunig describes four useful failure-mode labels for long contexts.[4]Reference 4How Long Contexts Failhttps://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html They are diagnostic categories, not a formal completeness proof. Each one can show up in our failed-canary agent.

Failure modeWhat it'sSymptom in the failed-canary agent
PoisoningA hallucination or error enters the context and is then referenced repeatedlyThe early wrong guess about a migration lock keeps getting cited
DistractionThe context grows so long the model over-focuses on its history and stops forming new plansThe agent re-runs the dead-end issue search instead of trying something new
ConfusionSuperfluous content (often too many tools) drives a low-quality responseWith dozens of tools loaded, the agent picks a cluster-admin tool
ClashNew information or instructions conflict with earlier ones in the contextA tool defined in XML contradicts the system rule to answer only in JSON

Don't turn reported examples into universal thresholds. Breunig cites an agent anecdote where long history encouraged repeated actions and a tool-use experiment where reducing tool count improved one model's benchmark result.[4]Reference 4How Long Contexts Failhttps://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html Those observations justify testing history pruning and tool gating on your model, tools, and task distribution.

failure-mode-router.py
1def diagnose_context(signals: set[str]) -> tuple[str, str]: 2 routes = [ 3 ("wrong_fact_repeated", "poisoning", "remove disproven spans and rebuild notes"), 4 ("old_trace_replayed", "distraction", "compact old history and retain decisions"), 5 ("irrelevant_tool_called", "confusion", "gate tools for the current phase"), 6 ("rules_disagree", "clash", "reconcile conflicting instructions"), 7 ] 8 for signal, mode, action in routes: 9 if signal in signals: 10 return mode, action 11 return "unknown", "inspect trace and add an evaluation case" 12 13mode, action = diagnose_context({"wrong_fact_repeated", "old_trace_replayed"}) 14print(f"diagnosis={mode}; next_action={action}")
Output
1diagnosis=poisoning; next_action=remove disproven spans and rebuild notes
Four compact cards for context poisoning, distraction, confusion, and clash, each with a short failure cue and matching curation move. Four compact cards for context poisoning, distraction, confusion, and clash, each with a short failure cue and matching curation move.
The first job is diagnosis. Name the failure mode, then reach for the matching move instead of blindly stuffing more tokens into the window.

The write, select, compress, isolate taxonomy

Naming failure modes tells you what went wrong. LangChain organizes agent context strategies into four useful buckets: write, select, compress, and isolate.[5]Reference 5Context Engineering for Agentshttps://blog.langchain.com/context-engineering-for-agents/ A useful mental model from that work: the LLM is like a CPU and its context window is like RAM, a working set to manage deliberately. The buckets classify many common tactics without claiming they exhaust every design.

Compact context-engineering loop showing working memory at center, four moves around it, and short examples that map techniques to write, select, compress, and isolate. Compact context-engineering loop showing working memory at center, four moves around it, and short examples that map techniques to write, select, compress, and isolate.
Most context decisions reduce to four moves. Save durable state outside the window, load only current evidence, shrink what must stay, and isolate noisy exploration into separate windows.

Write: keep state outside the window

The cheapest token is the one you never put in the window. Write means persisting information outside the context so it doesn't consume the attention budget until it's needed.[5]Reference 5Context Engineering for Agentshttps://blog.langchain.com/context-engineering-for-agents/ The classic pattern is a scratchpad: the agent writes notes, plans, or intermediate findings to a file or a state field, then reloads only the relevant note later. Agentic memory works the same way, persisting durable facts across sessions.[6]Reference 6Agent Memory: How to Build Agents that Learn and Rememberhttps://www.letta.com/blog/agent-memory

For the failed-canary agent, a write strategy means: instead of leaving four full runbooks in the conversation, the agent records "auth callback errors confirmed; migration lock ruled out" as a one-line note and drops the raw tool results. The finding survives; the tokens don't.

write-durable-findings.py
1def promote_findings(tool_results: list[dict]) -> tuple[list[str], list[str]]: 2 notes, discarded_raw = [], [] 3 for result in tool_results: 4 if result["confirmed"]: 5 notes.append(f"{result['source']}: {result['finding']}") 6 discarded_raw.append(result["raw_output"]) 7 return notes, discarded_raw 8 9notes, discarded = promote_findings([ 10 {"source": "deploy_RUN_842", "finding": "auth callback errors confirmed", "confirmed": True, "raw_output": "..." * 600}, 11 {"source": "db_lock_check", "finding": "migration lock ruled out", "confirmed": True, "raw_output": "..." * 900}, 12]) 13print("notes:", notes) 14print("raw_results_to_remove:", len(discarded))
Output
1notes: ['deploy_RUN_842: auth callback errors confirmed', 'db_lock_check: migration lock ruled out'] 2raw_results_to_remove: 2

Select: pull in only what this step needs

Select means retrieving only the tokens relevant to the current step.[5]Reference 5Context Engineering for Agentshttps://blog.langchain.com/context-engineering-for-agents/ Retrieval-augmented generation (RAG) is the canonical example, and you already studied it: a high-recall retriever surfaces a handful of relevant chunks instead of dumping the whole corpus. But selection applies to more than documents. Tool selection is a select problem too: the confusion failure mode above is what happens when you fail to gate tools and load all 50 instead of the 5 this task needs.

phase-specific-tool-gate.py
1TOOLS_BY_PHASE = { 2 "investigate": {"deploy_lookup", "trace_lookup", "runbook_search"}, 3 "resolve": {"runbook_search", "rollback_advisor", "page_oncall"}, 4} 5 6def tools_for_phase(phase: str, available: set[str]) -> list[str]: 7 allowed = TOOLS_BY_PHASE.get(phase, set()) 8 return sorted(allowed & available) 9 10available = {"deploy_lookup", "trace_lookup", "runbook_search", "rollback_advisor", "cluster_admin"} 11print("investigate tools:", tools_for_phase("investigate", available))
Output
1investigate tools: ['deploy_lookup', 'runbook_search', 'trace_lookup']

Prompt caching, which you met earlier, is an orthogonal efficiency tactic: if a stable prefix is reused, a provider may avoid repeating part of the prefill computation.[7]Reference 7Prompt caching.https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching Caching does not select better evidence or reduce the number of tokens the model reasons over. It improves eligible repeated-prefix economics only when the cache policy and provider support it.

cacheable-prefix-check.py
1import hashlib 2 3def prefix_key(system_prompt: str, stable_docs: str) -> str: 4 payload = system_prompt + "\n" + stable_docs 5 return hashlib.sha256(payload.encode()).hexdigest()[:12] 6 7stable = prefix_key("Use cited policy only.", "Policy version: 7") 8same_prefix = prefix_key("Use cited policy only.", "Policy version: 7") 9changed_prefix = prefix_key("Use cited policy only.", "Policy version: 8") 10print("reuse eligible:", stable == same_prefix) 11print("changed source invalidates candidate:", stable != changed_prefix)
Output
1reuse eligible: True 2changed source invalidates candidate: True

Compress: shrink what must stay

When information has to stay in the window, compress reduces it to the required tokens.[5]Reference 5Context Engineering for Agentshttps://blog.langchain.com/context-engineering-for-agents/ Two common candidate patterns follow.

The first is compaction: when a conversation approaches the budget, summarize it and start a fresh window seeded with that summary.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents The agent keeps its working knowledge but sheds the verbose transcript that produced it.

The second is tool-result pruning, a low-risk candidate fix for our failed-canary agent. Anthropic describes clearing old tool results as a light-touch form of compaction: once the relevant finding has been captured, old raw results can often leave the active context.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents Preserve evidence that the next decision still needs, and compare quality before adopting an aggressive pruning policy.

tool-result-pruning-policy.py
1def prune_results(results: list[dict], keep_recent: int) -> list[str]: 2 retained = [] 3 cutoff = max(0, len(results) - keep_recent) 4 for index, result in enumerate(results): 5 if index < cutoff and result["finding_recorded"]: 6 retained.append(f"[pruned raw output] finding={result['finding']}") 7 else: 8 retained.append(result["raw"]) 9 return retained 10 11history = [ 12 {"raw": "old trace span" * 100, "finding": "auth errors confirmed", "finding_recorded": True}, 13 {"raw": "latest canary trace", "finding": "rollback review needed", "finding_recorded": False}, 14] 15pruned = prune_results(history, keep_recent=1) 16print(pruned[0]) 17print(pruned[1])
Output
1[pruned raw output] finding=auth errors confirmed 2latest canary trace

Isolate: split work across focused windows

Isolate means splitting context across focused workers so no single window has to hold every exploratory trace.[5]Reference 5Context Engineering for Agentshttps://blog.langchain.com/context-engineering-for-agents/ A lead agent delegates a focused subtask, such as "check linked issues and rollback notes for every failed-canary exception", to a worker with its own clean context window. The worker does noisy exploration in isolation and returns a bounded evidence summary to the lead.[1]Reference 1Effective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Isolation can reduce distraction and confusion because the lead doesn't need every intermediate search result. It also adds coordination overhead and creates clash risk when worker outputs disagree, so isolate is a candidate for separable exploration, not a default.

isolated-handoff-contract.py
1def accept_handoff(handoff: dict, token_limit: int = 400) -> bool: 2 required = {"claim", "evidence", "next_check", "tokens"} 3 return required <= handoff.keys() and handoff["tokens"] <= token_limit 4 5handoff = { 6 "claim": "auth callback failure qualifies for staged rollback", 7 "evidence": "trace span 2026-05-28 and rollback runbook section 4", 8 "next_check": "quote rollback blast radius for RUN-842", 9 "tokens": 86, 10} 11print(f"bounded handoff accepted: {accept_handoff(handoff)}")
Output
1bounded handoff accepted: True

Putting it together: rebuilding the failed-canary agent's context

With the taxonomy in hand, the running example fixes itself. Instead of letting the window grow monotonically, the agent runs a curation step before each model call:

  1. Select only the tools relevant to the current phase (an investigation needs deploy lookup and traces, not cluster administration), then evaluate whether tool-call accuracy improves.
  2. Write durable findings to a scratchpad ("auth callback errors confirmed; migration lock ruled out") and drop the raw tool results.
  3. Compress by pruning trace exports older than a few turns and compacting the transcript once it grows large.
  4. Isolate the noisy "check every linked issue" exploration into a sub-agent that returns a one-paragraph summary.
  5. Detect poisoning: when the early migration-lock guess is identified as wrong, remove it from history so it stops being cited.

The window now holds the alert, current evidence, a short notes block, and a clean tool set. In the concrete script below, it shrinks an illustrative 12.4K-token raw set to a 3.2K-token active window plus short external notes. Lower input cost is immediate; better task performance still requires evaluation.

Flow chart splitting a 12,436-token raw agent trace into a 3,160-token active window, 186 external note tokens, and 9,090 removed stale, irrelevant, or poisoned tokens. Flow chart splitting a 12,436-token raw agent trace into a 3,160-token active window, 186 external note tokens, and 9,090 removed stale, irrelevant, or poisoned tokens.
Curation keeps current evidence and active tools, writes two durable findings to notes, and removes 73% of the raw trace. The smaller window remains a candidate until evaluation confirms that quality holds.
Diagram showing Raw history, Curate: write / select / compress / isolate, Small high-signal window, and Evaluate quality, latency, cost. Diagram showing Raw history, Curate: write / select / compress / isolate, Small high-signal window, and Evaluate quality, latency, cost.
Raw history, Curate: write / select / compress / isolate, Small high-signal window, and Evaluate quality, latency, cost.

The logic is simple enough to simulate with a tiny script. This version turns the failed-canary agent's messy transcript into a compact working set by keeping the current alert and evidence, writing durable findings to notes, and dropping poisoned or irrelevant tokens.

incident-agent-context-curation.py
1from dataclasses import dataclass 2 3@dataclass 4class ContextItem: 5 name: str 6 kind: str 7 tokens: int 8 signal: int 9 keep: str 10 11items = [ 12 ContextItem("alert_RUN_842", "task", 180, 10, "window"), 13 ContextItem("latest_trace_span", "evidence", 420, 10, "window"), 14 ContextItem("rollback_runbook", "evidence", 1800, 9, "window"), 15 ContextItem("scratchpad", "notes", 260, 8, "window"), 16 ContextItem("deploy_lookup_tool", "tool", 240, 8, "window"), 17 ContextItem("trace_lookup_tool", "tool", 260, 8, "window"), 18 ContextItem("old_trace_export", "log", 8200, 2, "drop"), 19 ContextItem("cluster_admin_tool", "tool", 360, 1, "drop"), 20 ContextItem("issue_search_tool", "tool", 410, 1, "drop"), 21 ContextItem("wrong_migration_lock_guess", "poison", 120, 0, "drop"), 22 ContextItem("auth_errors_confirmed", "finding", 90, 7, "notes"), 23 ContextItem("migration_lock_ruled_out", "finding", 96, 7, "notes"), 24] 25 26def summarize(selection): 27 return ", ".join(item.name for item in selection) 28 29raw_total = sum(item.tokens for item in items) 30window_items = [item for item in items if item.keep == "window"] 31notes_items = [item for item in items if item.keep == "notes"] 32dropped_items = [item for item in items if item.keep == "drop"] 33 34curated_total = sum(item.tokens for item in window_items) 35notes_total = sum(item.tokens for item in notes_items) 36 37print(f"raw_tokens={raw_total}") 38print(f"curated_window_tokens={curated_total}") 39print(f"external_notes_tokens={notes_total}") 40print(f"removed_tokens={raw_total - curated_total - notes_total}") 41print("window:", summarize(window_items)) 42print("notes:", summarize(notes_items)) 43print("dropped:", summarize(dropped_items))
Output
1raw_tokens=12436 2curated_window_tokens=3160 3external_notes_tokens=186 4removed_tokens=9090 5window: alert_RUN_842, latest_trace_span, rollback_runbook, scratchpad, deploy_lookup_tool, trace_lookup_tool 6notes: auth_errors_confirmed, migration_lock_ruled_out 7dropped: old_trace_export, cluster_admin_tool, issue_search_tool, wrong_migration_lock_guess
working-set-budget-packer.py
1from dataclasses import dataclass 2 3@dataclass 4class Candidate: 5 name: str 6 tokens: int 7 priority: int 8 required: bool = False 9 10def pack_working_set(candidates: list[Candidate], budget: int) -> list[str]: 11 ordered = sorted(candidates, key=lambda item: (not item.required, -item.priority)) 12 selected, used = [], 0 13 for item in ordered: 14 if used + item.tokens <= budget: 15 selected.append(item.name) 16 used += item.tokens 17 elif item.required: 18 raise ValueError(f"required item does not fit: {item.name}") 19 return selected 20 21items = [ 22 Candidate("alert RUN-842", 180, 10, required=True), 23 Candidate("latest trace span", 420, 10, required=True), 24 Candidate("rollback runbook", 1_800, 9), 25 Candidate("stale trace export", 8_200, 1), 26] 27print(pack_working_set(items, budget=3_000))
Output
1['alert RUN-842', 'latest trace span', 'rollback runbook']

Why this matters as windows grow

Large-window model offerings make it tempting to treat curation as obsolete.[8]Reference 81M context is now generally available for Opus 4.6 and Sonnet 4.6https://claude.com/blog/1m-context-ga Bigger windows raise the capacity ceiling; they don't establish quality for an overloaded agent trace. Frameworks such as LangGraph expose short- and long-term memory plus summarization or deletion patterns because state management remains an application responsibility.[9]Reference 9LangGraph Memory Overviewhttps://docs.langchain.com/oss/python/concepts/memory

The practical upshot: when an agent underperforms, inspect the transmitted context alongside model and window choices. Name a suspected failure mode, apply a bounded candidate change, and measure whether it improves the task.

Common pitfalls

SymptomLikely causeFix
Agent gets worse over a long session despite spare windowDistraction from accumulated stale historyCompact the transcript; prune old tool results
Agent keeps citing a wrong factContext poisoning: an early error is being re-referencedRemove the bad tokens from history; don't just add a correction
Agent picks irrelevant tools or ignores the right oneContext confusion from overlapping active toolsGate tools per phase; evaluate a smaller active set
Model violates a format rule when a Model Context Protocol (MCP) tool is attachedContext clash between tool instructions and system rulesReconcile instructions or isolate the tool behind a sub-agent
Costs balloon and latency rises with no quality gainStuffing the window instead of curating itApply the smallest-high-signal-set principle: select and compress
Sub-agent answers conflict with each otherIsolation without reconciliationHave the lead agent resolve clashes before acting

Use this checklist before shipping an agent that handles long sessions:

  • Does the context shrink or stay bounded across turns, or does it only grow? Unbounded growth invites distraction and rot.
  • Are tool results pruned or compacted once they are no longer needed for the next step?
  • Does a phase-gated active tool set outperform loading every available tool on your evaluation cases?
  • When a sub-agent is used, does it return a distilled summary rather than its full transcript?
  • Do you have an eval that catches poisoning, that's, a wrong fact persisting and being re-cited across turns?

Diagnostic playbook

When a long-running agent underperforms, use this sequence:

  1. Inspect the actual context, not the window size or cost alone.
  2. Name the failure mode: poisoning, distraction, confusion, or clash.
  3. Choose the matching move: write, select, compress, or isolate.
  4. Rebuild the next call around the smallest high-signal working set.
  5. Compare baseline and curated calls on accuracy, latency, token cost, and failure recurrence.
  6. Only after that ask whether you still need a different model, window, or architecture.
remove-disproven-facts.py
1def rebuild_notes(notes: list[dict]) -> list[str]: 2 return [ 3 note["text"] 4 for note in notes 5 if note["status"] != "disproven" 6 ] 7 8notes = [ 9 {"text": "auth callback errors confirmed", "status": "confirmed"}, 10 {"text": "migration lock caused RUN-842", "status": "disproven"}, 11 {"text": "database lock ruled out", "status": "confirmed"}, 12] 13print("rebuilt notes:", rebuild_notes(notes))
Output
1rebuilt notes: ['auth callback errors confirmed', 'database lock ruled out']
curation-release-gate.py
1def approve_curation( 2 baseline_accuracy: float, 3 curated_accuracy: float, 4 baseline_tokens: int, 5 curated_tokens: int, 6 poisoned_references_after: int, 7) -> bool: 8 quality_ok = curated_accuracy >= baseline_accuracy 9 cost_ok = curated_tokens < baseline_tokens 10 poisoning_removed = poisoned_references_after == 0 11 return quality_ok and cost_ok and poisoning_removed 12 13approved = approve_curation( 14 baseline_accuracy=0.82, 15 curated_accuracy=0.87, 16 baseline_tokens=12_436, 17 curated_tokens=3_346, 18 poisoned_references_after=0, 19) 20print(f"curated context approved: {approved}")
Output
1curated context approved: True

Long-context window management gave you the machinery to hold many tokens. Context engineering adds the judgment to decide which tokens earn their place. The next chapter shifts to model architecture, where sparse routing changes the compute-per-token economics that all of this sits on top of.

Mastery check

Before moving on, make sure you can do all of these without hand-waving:

  • Explain how context engineering differs from prompt engineering and from window management, and why all three are distinct skills.
  • Define context rot and use it to justify evaluating a curated alternative even when a prompt fits.
  • Diagnose an agent failure as poisoning, distraction, confusion, or clash from its symptoms, and name a fix for each.
  • Classify RAG, summarization, scratchpads, and focused workers into write, select, compress, or isolate, and explain why prompt caching instead changes repeated-prefix cost.
  • Rebuild the failed-canary agent's next call around a small high-signal working set and explain what stays in the window, what moves to notes, and what gets dropped.
  • Explain why a larger supported window doesn't remove the need for curation evaluation.

Evaluation rubric

  • Strong: You name the failure mode, pick the matching move, and specify what stays in the active window, what moves outside it, and what should be deleted.
  • Partial: You notice bloat and suggest summarization or a bigger window, but you don't diagnose poisoning, distraction, confusion, or clash with enough specificity.
  • Weak: You treat "it still fits" as proof the context is healthy, or you leave poisoned and stale tokens in place because capacity remains.

Follow-up questions

Our agent has a 1M-token window. Why bother curating context at all?

Because window size sets capacity, not quality. Context-rot evaluations report that reliability can decline as input grows, so a technically valid prompt still deserves comparison with a curated candidate. The relevant measurements are task quality, cost, latency, and whether the context contains the evidence needed for the answer.

How do you decide between compressing the context and isolating it into a sub-agent?

Compress when the information must remain available to the same reasoning thread but is too verbose, for example a long transcript that you summarize or stale tool results you prune. Isolate when a subtask is noisy and self-contained, such as a broad search that will generate dozens of intermediate results the main agent never needs to see. Isolation buys the cleanest separation but adds coordination overhead and a risk of clash between sub-agent outputs, so prefer compression for in-thread bloat and reserve isolation for genuinely separable, exploration-heavy subtasks.

An agent keeps repeating a wrong fact about an incident. Which failure mode is this and how do you fix it?

That's context poisoning: an error entered the context and is now being re-referenced on later turns. Appending a correction may not help, because both the wrong fact and the correction stay in view. A strong candidate fix is to remove disproven notes from the rebuilt context (or compact the transcript without them), then run an evaluation that checks whether the wrong fact is still re-cited.

How many tools is too many to load into one agent's context?

There's no universal limit. Treat overlapping tool descriptions and irrelevant tool calls as symptoms to measure. Test a phase-gated tool set against the full registry on representative tasks, then retain the smallest set that preserves coverage and improves selection accuracy. That's a selection fix for the confusion failure mode.

You're curating the failed-canary agent's next call. What belongs in the window, what should move to notes, and what should be dropped?

Keep only the current alert, latest trace span, applicable rollback runbook, short scratchpad, and small tool set needed for this phase. Move durable findings such as "auth callback errors confirmed" or "migration lock ruled out" into notes so the knowledge survives without bloating the active window. Drop stale logs, irrelevant tools, and any poisoned guesses that the agent has already proven false. That answer uses the full taxonomy at once: write out durable facts, select current evidence, compress or prune old history, and isolate anything noisy enough to deserve its own clean sub-window.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An incident agent already fits within the model window and serving latency is acceptable, but each call includes raw trace exports, irrelevant tool definitions, and a disproven migration-lock guess. Which engineering move targets the actual problem?
2.A trace audit finds these symptoms: a disproven migration-lock cause is still cited; the agent re-runs an old issue search; it picks a cluster-admin tool from an overloaded tool set; and an XML tool instruction conflicts with a JSON-only system rule. Which mapping is correct?
3.A team wants to append every linked issue and trace export to a prompt because the model's hard limit is far away. What does context rot imply they should measure before adopting that policy?
4.During a curation pass, the agent persists 'auth callback errors confirmed' to notes, loads only investigation-phase tools, replaces old raw trace spans with recorded findings, and sends linked-issue exploration to a worker that returns a bounded summary. Which labels match those four actions in order?
5.Use compression when verbose information must remain available to the same reasoning thread, and isolation when a separable subtask will produce noisy intermediate traces the lead agent does not need. An incident agent has both a long decision transcript and a broad linked-issue search. Which plan best applies these context-engineering rules?
6.The team notices that the same system prompt and policy version are sent on many calls, so the provider may cache the stable prefix. They also see the agent choosing irrelevant tools because all 50 tool descriptions are loaded. What conclusion follows?
7.The next call must resolve deploy RUN-842. The available context includes the alert, latest trace span, applicable rollback runbook, a short scratchpad, phase-relevant lookup tools, confirmed findings, stale trace logs, irrelevant issue-search and cluster-admin tools, and a disproven migration-lock guess. Which curation plan preserves needed evidence while reducing confusion and poisoning?
8.A long-session incident agent suddenly underperforms. The trace still fits in the selected model's window, and no one has inspected the actual messages sent to the model. What should the team do before switching models or expanding the window?

8 questions remaining.

Next Step
Continue to Mixture of Experts Architecture

There, you'll examine sparse expert routing, capacity-versus-compute trade-offs, and the serving measurements needed to evaluate an <span data-glossary="mixture-of-experts">MoE</span> deployment.

PreviousLong Context Window Management
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Effective context engineering for AI agents

Anthropic · 2025

A Survey of Context Engineering for Large Language Models.

Multiple authors · 2025 · arXiv preprint

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Hong, K., Troynikov, A., & Huber, J. · 2025

How Long Contexts Fail

Breunig, D. · 2025

Context Engineering for Agents

LangChain · 2025

Agent Memory: How to Build Agents that Learn and Remember

Letta · 2026

Prompt caching.

Anthropic. · 2026 · Official documentation

1M context is now generally available for Opus 4.6 and Sonnet 4.6

Anthropic · 2026

LangGraph Memory Overview

LangChain · 2026