LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalAI Coding Workflow with Agents
🤖HardLLM Agents & Tool Use

AI Coding Workflow with Agents

Scope coding-agent tasks, isolate execution, keep patches on branches, verify behavior, and preserve human merge ownership.

27 min read
Learning path
Step 122 of 158 in the full curriculum
Human-in-the-Loop Agent ArchitectureAgent Memory & Persistence

Human-in-the-loop systems pause agents before high-risk actions. A coding agent creates two kinds of risk: it can run commands during its work, and it can propose source changes that become consequential once reviewed, merged, or deployed. A small patch can alter auth, expose a secret, break billing, or change deployment behavior.

AI coding agents can still write a pull request while you work on something else.

That sounds useful until the diff touches the wrong files, skips the failing test, updates a dependency for no reason, and hides a security problem behind a confident summary.

Adding an agent means adding an automated code-generation worker to your engineering workflow. If you tell it to "fix the billing webhook" and leave the task unbounded, it may return code that looks plausible but changes the wrong behavior. The engineering skill isn't "ask for code and trust the summary." It's designing a workflow where the agent can help without bypassing execution, review, or deployment controls.

Code Generation & Sandboxing and Human-in-the-Loop Agent Architecture covered the two controls this workflow needs: generated code runs in isolation, and risky agent actions require approval gates. Here, you'll apply those controls to an isolated workspace, branch, test, review, and deploy loop.

The workflow below is the operating model for the rest of the article: the agent works in a restricted execution environment and writes only to a review branch. Tests, redacted evidence, review, and merge ownership remain explicit.

AI coding agent workflow with explicit scope, restricted execution and branch work, redacted check evidence, and human merge ownership AI coding agent workflow with explicit scope, restricted execution and branch work, redacted check evidence, and human merge ownership
A branch makes source changes reviewable; a restricted workspace limits what commands can affect while the patch is produced. Both controls are needed before a reviewer decides whether a diff can merge.

What a coding agent is

First, separate two things people often blur together. A coding assistant suggests code as you type, inside your editor. A coding agent takes a scoped task, reads the repository, plans changes, edits files, runs commands, and produces a branch or pull request with little keystroke-by-keystroke input from you. The agent loop needs the strongest controls because it can change files and run commands.

A coding agent, then, is an AI system that can inspect a repository, plan changes, edit files, run commands, and produce a branch or patch.

Modern tools expose this across several surfaces, and their features change quickly. GitHub documents Copilot coding agent as working in an ephemeral GitHub Actions-powered environment where it can make changes, execute tests and linters, and open a pull request for review.[1] Anthropic's Claude Code docs describe an agentic tool that reads a codebase, edits files, and runs commands across terminal and other surfaces.[2] OpenAI's Codex app documentation describes local, worktree, and cloud task modes, including isolated worktrees for parallel tasks.[3] Product surfaces differ; the stable lesson is to control execution, bound diffs, collect evidence, and keep merge authority outside the agent.

A useful workflow is plan-then-edit: let the agent inspect the necessary code, ask it for a bounded plan, correct that plan, and only then authorize an edit pass. A plan isn't proof of correctness, but catching the wrong approach before a large diff is much cheaper than untangling unrelated changes afterward.

The core loop is familiar from agent design. In coding work, every transition creates a reviewable artifact.

Coding agent control loop from scoped task through repo reading, restricted patch execution, checks, pull request evidence, human review, and merge ownership. Coding agent control loop from scoped task through repo reading, restricted patch execution, checks, pull request evidence, human review, and merge ownership.
The useful loop is explicit: scope the task, patch in a restricted workspace, collect redacted evidence from checks, and let review send changes back to patching instead of accepting scope drift.

An agent can move through many steps, but the human still owns the merge.

How agents plan and correct themselves

A coding agent doesn't write code in a single shot the way a human might type an email. It works in a loop: read the problem, plan a step, take an action, observe the result, and decide what to do next.

This loop is ReAct (Reason + Act).[4] The agent might read a test file, notice the failure is in a validation function, edit the function, run the tests again, see a new error, and fix that too. Each round gives the agent local feedback, which is why bounded tasks with fast tests work so well.

The thing that closes this loop is a verifier: an automatic check the agent can run after each edit to decide whether it got closer. In coding work, tests, type checkers, and linters are the cheapest, most reliable verifiers you have. A failing test that the agent can rerun turns a vague "is this right?" into a concrete pass or fail signal. No verifier means the agent is guessing, and you're reviewing guesses.

A verifier loop also needs a stopping rule. This tiny simulation takes successive test outcomes as input and returns either the first verified patch or an escalation after its repair budget is exhausted.

bound-agent-repair-loop.py
1def run_repair_loop(test_results: list[str], max_attempts: int = 3) -> str: 2 for attempt, result in enumerate(test_results[:max_attempts], start=1): 3 if result == "pass": 4 return f"verified on attempt {attempt}" 5 return "stop: request human diagnosis" 6 7print("recovering patch:", run_repair_loop(["fail", "pass"])) 8print("stuck patch:", run_repair_loop(["fail", "fail", "fail", "pass"]))
Output
1recovering patch: verified on attempt 2 2stuck patch: stop: request human diagnosis

A second pattern, Reflection, lets an agent critique its own draft before showing it to you.[5] After producing a diff, the agent can run a second pass to check for security risks, missing tests, or style violations. That's the difference between a single oversized agent that tries to do everything at once and a workflow where separate passes handle exploration, implementation, review, and verification.

Modern agents pull repository context through tools like the Model Context Protocol (MCP), which lets them query files, documentation, and APIs without you copy-pasting text into a chat window.[6] The shift from writing better prompts to giving the agent better data is called context engineering. A well-scoped task with precise file ownership and a clean test suite gives the agent far more signal than a giant prompt full of instructions.

Good tasks are bounded

Bad task:

Improve the app and clean up anything you find.

That prompt invites scope drift.

Better task:

Fix the account settings save bug.

Scope:

  • Own files under web/src/app/settings/ and web/src/lib/settings.ts.
  • Don't edit auth, billing, deployment, or unrelated styling.
  • Work without production credentials or external side effects.

Acceptance:

  • Reproduce failing save behavior with a test.
  • Fix the bug.
  • Run pnpm test --run settings and pnpm lint.
  • Summarize files changed and remaining risk.

Execution:

  • Use the restricted project runner.
  • Don't install dependencies, run deploys, or call production systems.

That prompt gives the agent a useful box.

Prompt fieldWhy it helps
GoalDefines the behavior and the code area.
ScopeLimits file ownership and blast radius.
Non-goalsPrevents surprise cleanup.
Acceptance checksGives the agent a finish line.
CommandsMakes verification explicit.
Execution policyPrevents the patch task from gaining unrelated capabilities.
Risk notesTells review where to look.

Bounded tasks reduce merge conflicts and make review possible.

Split agent roles

Don't use one vague agent pass for everything.

Use roles with clear ownership boundaries:

You define the database schema and API contracts first, then assign a bounded worker to fill in the implementation while you review every change at each step. If you hand an agent a vague task like "Build me an access-control service" and disappear, it's likely to fail.

RoleGood useBad use
ExplorerRead code and answer one question.Making broad edits without ownership.
WorkerImplement a bounded change.Changing architecture without review.
ReviewerFind bugs, risks, and missing tests.Rubber-stamping its own diff.
VerifierRun checks and inspect output.Treating command success as product success.

This separation matters because agents are good at producing plausible work. A reviewer agent shouldn't be asked to defend the same implementation it just wrote.

Even when one tool provides one interface, you can still run the workflow in passes:

  1. Ask for a short codebase read.
  2. Assign a bounded patch.
  3. Ask for a review of the diff.
  4. Run tests yourself or in CI.
  5. Inspect the behavior before merge.

Restricted workspace to branch workflow

A controlled workflow separates two boundaries: a restricted environment for commands executed while producing a patch, and version control for reviewing the proposed source changes.

Restricted workspace to branch workflow showing issue scope, sandbox and branch controls, small commits, CI checks, human review, and rejection when scope drifts. Restricted workspace to branch workflow showing issue scope, sandbox and branch controls, small commits, CI checks, human review, and rejection when scope drifts.
A branch is an integration and rollback boundary, not a sandbox. Restrict command execution and credentials first; then use the branch to review and reject source-code drift before merge.

The branch is an integration boundary. It makes proposed changes inspectable before they enter a protected branch.

The review wrapper provides:

  • diff inspection
  • rollback path
  • CI checks
  • review comments
  • audit trail
  • privileged command separation

It doesn't stop an agent command from reading local credentials, making network calls, installing a dependency, or altering external infrastructure. That requires a restricted runner: an ephemeral workspace, least-privilege credentials, constrained network and command policy, and explicit review before privileged operations. If an agent edits directly on main, you also lose much of the integration boundary.

Execution boundary

The task contract should state what the agent may execute as well as what it may edit. For a billing webhook test fix, an agent may need to run unit tests and read local source files; it doesn't need production secrets, package publishing, infrastructure changes, or outbound calls to production systems.

SurfaceDefault rule for a bounded patch
FilesystemWrite only to working tree or worktree for assigned branch.
CredentialsProvide no production tokens; use fixtures or scoped test credentials only when needed.
NetworkDeny by default or allow only approved package/test endpoints.
CommandsAllow repository test, lint, and format commands; review installs, migrations, and generated scripts.
External effectsBlock deploys, releases, emails, payment writes, and infrastructure mutations.

This admission check is intentionally small. It takes an argument vector rather than an interpolated shell string, then admits only commands named in the task contract. In a real runner, execute the approved argument vector without shell concatenation and keep the operating-system sandbox in place.

admit-coding-agent-command.py
1CONTRACT_COMMANDS = { 2 ("pnpm", "test", "--run", "billing"), 3 ("pnpm", "lint"), 4 ("pnpm", "exec", "tsc", "--noEmit"), 5 ("git", "diff"), 6} 7BLOCKED_PROGRAMS = { 8 "curl", "npm", "pip", "publish", "terraform", "uv", 9} 10 11def admit_command(argv: tuple[str, ...], available_secrets: tuple[str, ...]) -> str: 12 if available_secrets: 13 return "blocked: credentials mounted" 14 if not argv: 15 return "blocked: empty command" 16 program = argv[0].removeprefix("./") 17 if program in BLOCKED_PROGRAMS: 18 return "blocked: privileged or expanding command" 19 if argv in CONTRACT_COMMANDS: 20 return "allowed in restricted runner" 21 return "needs explicit review" 22 23print("test:", admit_command(("pnpm", "test", "--run", "billing"), ())) 24print("infra:", admit_command(("terraform", "apply"), ())) 25print("secreted test:", admit_command(("pnpm", "test", "--run", "billing"), ("PROD_API_KEY",)))
Output
1test: allowed in restricted runner 2infra: blocked: privileged or expanding command 3secreted test: blocked: credentials mounted

File ownership example

For parallel agent work, write ownership down.

text
1Worker A owns: 2- `web/src/app/settings/page.tsx` 3- `web/src/lib/settings.ts` 4- `web/src/lib/settings.test.ts` 5 6Worker B owns: 7- `web/src/app/billing/page.tsx` 8- `web/src/lib/billing.ts` 9- `web/src/lib/billing.test.ts` 10 11Both workers: 12- Must not edit shared auth middleware. 13- Must not change package versions. 14- Must run targeted tests before reporting done.

This style prevents agents from overwriting each other and makes it obvious when a diff goes outside scope.

Ownership can be checked before two patches are combined. This example takes proposed changed files from two workers and detects the shared file that neither handoff resolved.

detect-overlapping-agent-edits.py
1worker_a = {"web/src/lib/settings.ts", "web/src/lib/settings.test.ts"} 2worker_b = {"web/src/lib/billing.ts", "web/src/lib/settings.ts"} 3 4overlap = sorted(worker_a & worker_b) 5print("merge status:", "blocked" if overlap else "ready") 6print("overlap:", ", ".join(overlap) if overlap else "none")
Output
1merge status: blocked 2overlap: web/src/lib/settings.ts

Tests aren't optional

Agent-generated code often looks correct.

That doesn't mean behavior is correct.

Benchmarks such as SWE-bench evaluate agents on real GitHub issues because repository work is full of hidden context, failing tests, and edge cases that don't appear in isolated coding tasks.[7] SWE-agent and Agentless both study how agent-computer interfaces and repair strategies affect software engineering performance on real tasks.[8][9]

Read SWE-bench Verified numbers carefully. Verified tasks come from public GitHub history, which creates contamination risk as code enters training sets. SWE-bench Pro includes substantially different, longer-horizon tasks and a private-repository subset intended to reduce that risk.[10] OpenAI reported in February 2026 that Verified was no longer a sound frontier evaluation for its models and recommended SWE-bench Pro for more informative measurement.[11] The practical reading is stable: a benchmark score isn't evidence that an agent fixed your bug. Your own failing test is.

For day-to-day engineering, the lesson is simple: require evidence.

Change typeMinimum evidence
Bug fixRepro test fails before and passes after.
FeatureTests cover normal path and error path.
RefactorExisting tests pass and behavior diff is explained.
UI changeScreenshot or Playwright smoke when practical.
API changeContract test or curl example.
Security changeThreat-specific test or manual review note.

The agent's summary isn't evidence.

The diff, tests, redacted logs, screenshots, and reproduced behavior are evidence.

A reproduction should demonstrate the reported defect before the patch and the intended invariant after it. Here the same billing webhook retry creates two invoices before idempotency handling and one after it.

prove-billing-webhook-retry-fix.py
1events = ["billing_event_78291", "billing_event_78291"] 2 3def create_without_idempotency(retries: list[str]) -> list[str]: 4 return [f"invoice_for_{retry}" for retry in retries] 5 6def create_with_idempotency(retries: list[str]) -> list[str]: 7 return [f"invoice_for_{retry}" for retry in dict.fromkeys(retries)] 8 9print("before patch invoices:", len(create_without_idempotency(events))) 10print("after patch invoices:", len(create_with_idempotency(events)))
Output
1before patch invoices: 2 2after patch invoices: 1

Command output is useful only after it's safe to attach to a pull request. This redactor takes captured test output and removes a credential value before the evidence packet is shown to reviewers.

redact-test-evidence.py
1import re 2 3def redact_output(output: str) -> str: 4 return re.sub(r"PROD_API_KEY=\S+", "PROD_API_KEY=[redacted]", output) 5 6captured = "billing webhook retry failed PROD_API_KEY=secret-abc123 webhook_id=78291" 7review_output = redact_output(captured) 8print(review_output) 9print("secret visible:", "secret-abc123" in review_output)
Output
1billing webhook retry failed PROD_API_KEY=[redacted] webhook_id=78291 2secret visible: False

This example catches one known credential shape for teaching purposes. Production evidence pipelines should redact against the secrets mounted in the runner, apply a secret scanner, and cap attached output before it reaches a pull request.

Review the diff like production code

Start review with these questions:

  1. Did the diff solve the requested behavior?
  2. Did it edit only the intended files?
  3. Did it add or update tests that would fail without the fix?
  4. Did it touch auth, secrets, dependencies, shell commands, file access, billing, or deployment?
  5. Did it create hidden fallback behavior?
  6. Did it remove error handling, logging, or validation?
  7. Did it make code harder to maintain?

Then inspect dangerous patterns.

PatternRisk
eval, shell commands, dynamic importsCode execution risk.
New dependencySupply chain and bundle risk.
Broad catch blockHidden failures.
Silent fallbackProduct behavior drifts without alerting.
Secret in code or captured outputData exposure.
Tests that only assert renderingBehavior may still be broken.

OWASP's LLM security work identifies prompt injection and sensitive information disclosure as major categories for LLM applications.[12] Coding agents add another angle: untrusted issue text, docs, comments, and test data may influence code changes or attempt to make the agent execute a command. Treat repository text as evidence about the task, not authority to broaden permissions, expose secrets, or run privileged operations.

Instructions discovered inside a repository don't expand the task's authority. The authority check distinguishes a trusted task contract from a README instruction that attempts to authorize a privileged command.

keep-repository-text-out-of-authority.py
1TASK_COMMANDS = {"pnpm test --run billing", "pnpm lint"} 2 3def command_decision(command: str, source: str) -> str: 4 if source != "task_contract": 5 return "blocked: untrusted repository instruction" 6 return "allowed" if command in TASK_COMMANDS else "blocked: outside contract" 7 8print("task test:", command_decision("pnpm test --run billing", "task_contract")) 9print("README infra:", command_decision("terraform apply", "README.md"))
Output
1task test: allowed 2README infra: blocked: untrusted repository instruction

A practical review checklist

Use this checklist on every agent-generated PR:

CheckPass condition
ScopeChanged files match task ownership.
BehaviorDiff directly maps to acceptance criteria.
TestsAt least one test would fail before fix.
CommandsReported checks ran.
SecurityNo unsafe shell, file, dependency, or secret handling.
ObservabilityLogs and errors remain useful.
SimplicityNo broad rewrites unrelated to task.
RollbackReverting the PR returns to previous behavior.

If a PR fails the scope check, stop early.

Don't spend an hour reviewing unrelated churn. Ask for a smaller diff.

Review intensity can be computed from proposed paths before a reviewer opens every line. This helper takes changed files as input and routes payment, auth, deployment, and workflow edits to focused review.

route-sensitive-code-review.py
1SENSITIVE = ("auth", "billing", "billing/webhook", ".github/workflows", "deploy") 2 3def review_route(paths: tuple[str, ...]) -> str: 4 touched = [path for path in paths if any(part in path for part in SENSITIVE)] 5 return "focused security review" if touched else "standard review" 6 7print("settings only:", review_route(("web/src/app/settings/page.tsx",))) 8print("payment edit:", review_route(("web/src/billing/webhook/create.ts",)))
Output
1settings only: standard review 2payment edit: focused security review

You can encode part of this review check as a small policy rule before a human starts deep review. The check isn't a replacement for judgment; it catches obvious scope drift and weak evidence so reviewers don't waste time on unreviewable patches.

agent-pr-scope-gate.py
1from dataclasses import dataclass 2 3SENSITIVE_MARKERS = ("/auth/", "/billing/", "/deploy/", ".github/workflows/") 4DEPENDENCY_FILES = ("package.json", "pnpm-lock.yaml", "uv.lock") 5 6@dataclass 7class AgentTask: 8 owned_paths: tuple[str, ...] 9 requires_repro_test: bool = True 10 11@dataclass 12class PullRequestEvidence: 13 changed_files: tuple[str, ...] 14 added_tests: tuple[str, ...] 15 reported_commands: tuple[str, ...] 16 17def review_gate(task: AgentTask, evidence: PullRequestEvidence) -> list[str]: 18 findings: list[str] = [] 19 20 def is_owned(path: str) -> bool: 21 return any( 22 path.startswith(owned) if owned.endswith("/") else path == owned 23 for owned in task.owned_paths 24 ) 25 26 out_of_scope = [ 27 path for path in evidence.changed_files 28 if not is_owned(path) 29 ] 30 if out_of_scope: 31 findings.append(f"scope_drift: {', '.join(out_of_scope)}") 32 33 sensitive = [ 34 path for path in evidence.changed_files 35 if any(marker in f"/{path}" for marker in SENSITIVE_MARKERS) 36 ] 37 if sensitive: 38 findings.append(f"sensitive_area: {', '.join(sensitive)}") 39 40 dependencies = [ 41 path for path in evidence.changed_files 42 if path.rsplit("/", 1)[-1] in DEPENDENCY_FILES 43 ] 44 if dependencies: 45 findings.append(f"dependency_change: {', '.join(dependencies)}") 46 47 if task.requires_repro_test and not evidence.added_tests: 48 findings.append("missing_repro_test") 49 50 if not evidence.reported_commands: 51 findings.append("missing_verification_commands") 52 53 return findings 54 55task = AgentTask(owned_paths=( 56 "web/src/app/settings/", 57 "web/src/lib/settings.ts", 58 "web/src/lib/settings.test.ts", 59)) 60pr = PullRequestEvidence( 61 changed_files=( 62 "web/src/app/settings/page.tsx", 63 "web/src/lib/settings.ts", 64 "auth/middleware.ts", 65 "pnpm-lock.yaml", 66 ), 67 added_tests=(), 68 reported_commands=("pnpm test --run settings",), 69) 70 71findings = review_gate(task, pr) 72print("BLOCK" if findings else "PASS") 73for finding in findings: 74 print("-", finding)
Output
1BLOCK 2- scope_drift: auth/middleware.ts, pnpm-lock.yaml 3- sensitive_area: auth/middleware.ts 4- dependency_change: pnpm-lock.yaml 5- missing_repro_test

This tiny checker catches the same issue a senior reviewer would spot first: the settings fix may be real, but the PR also touched auth and the lockfile without ownership. A lockfile change isn't automatically unsafe, but it requires an intentional dependency review rather than hitching a ride in a settings fix. The next action is extraction or rescoping, not merge.

Where agents help most

Agents are strongest when the task has clear boundaries and local feedback.

Good uses:

  • add missing tests for an existing function
  • fix a narrow bug with a reproduction
  • migrate one component pattern
  • update docs from code
  • add input validation
  • refactor one module with existing tests
  • write first draft of a benchmark harness
  • inspect a codebase and answer a focused question

Risky uses:

  • rewrite auth
  • change billing
  • rotate secrets
  • design a new distributed system without review
  • update many dependencies at once
  • "clean up the codebase"
  • make product decisions without human context

Feedback changes the work.

If an agent can run a test and see whether it got closer, it has a useful loop. If the task needs product judgment, legal judgment, security judgment, or architecture taste, the agent can assist, but it shouldn't decide alone.

Worked example: Refactoring with agents

A multi-agent refactor should start with a messy but functional billing-webhook retry module, then run it through separate passes.

Step A - Exploration: Ask an explorer agent to read the module and list the highest-risk maintainability issues. It reports duplicated idempotency checks, one oversized handler, and missing error handling for payment-provider timeouts.

Step B - Architecture: Ask an architect agent to propose a modular structure. It suggests splitting the module into validateWebhook.ts, loadInvoiceState.ts, and handleProviderTimeout.ts, with clear interfaces between them.

Step C - Implementation: Ask a worker agent to execute the split, one file at a time, keeping existing tests green. It produces small commits with descriptive messages.

Step D - Verification: Ask a verifier agent to run the targeted tests, review the diff against the original behavior contract, and confirm the timeout failure path is now tested. It reports the exact commands and changed behavior instead of a confidence summary.

Each pass has a different role, a bounded scope, and a clear handoff. The human reviews the final diff before anything merges.

Incident story

Start with an issue:

Users sometimes see duplicate invoices after replaying a billing webhook.

A weak agent task says:

Fix duplicate invoices.

The agent might add a frontend debounce. That may reduce clicks but won't fix duplicate server-side side effects.

A strong task says:

Investigate duplicate invoice creation in billing webhook retry flow.

Scope:

  • Read billing webhook handler, invoice creation, and retry tests.
  • Don't edit payment provider config.
  • First report likely cause and proposed fix.
  • Then add a failing test for duplicate retry.
  • Fix with idempotency key on server-side invoice creation.
  • Run billing webhook tests.

This task guides the agent toward the real class of bug: idempotency.

It also splits investigation from implementation. The human can stop a wrong plan before code changes land.

The implementation invariant is small enough to execute locally. Invoice creation accepts a webhook idempotency key and returns the existing invoice for a retry instead of inserting a second side effect.

make-invoice-creation-idempotent.py
1class InvoiceStore: 2 def __init__(self) -> None: 3 self.by_webhook_key: dict[str, str] = {} 4 5 def create_once(self, webhook_key: str) -> str: 6 if webhook_key not in self.by_webhook_key: 7 next_id = len(self.by_webhook_key) + 1 8 self.by_webhook_key[webhook_key] = f"inv_{next_id}" 9 return self.by_webhook_key[webhook_key] 10 11store = InvoiceStore() 12first = store.create_once("webhook_78291") 13retry = store.create_once("webhook_78291") 14print("first:", first) 15print("retry returns same invoice:", retry == first) 16print("invoice count:", len(store.by_webhook_key))
Output
1first: inv_1 2retry returns same invoice: True 3invoice count: 1

Team policy

Teams should write down how AI coding agents are allowed to work.

Minimum policy:

  • Agents run commands in restricted workspaces without production secrets or deployment authority.
  • Agents commit source changes only to branches or pull requests, never directly to protected branches.
  • Sensitive files, dependency changes, and privileged commands require focused human approval.
  • Every generated PR needs human review.
  • CI must run before merge.
  • Security-sensitive diffs need threat-specific verification.
  • Agent summaries and unredacted command output aren't proof.
  • Large diffs are split before review.

This isn't anti-agent.

It's how agents become useful inside real engineering teams.

A mechanical gate can reject missing evidence, but it must not perform the merge. This example returns a candidate for human review only when branch, sandbox, tests, and sensitive-review requirements are satisfied.

gate-agent-merge-candidate.py
1from dataclasses import dataclass 2 3@dataclass 4class Evidence: 5 on_review_branch: bool 6 restricted_runner: bool 7 repro_test_passed: bool 8 touches_sensitive_area: bool 9 focused_review_complete: bool 10 11def merge_candidate(evidence: Evidence) -> str: 12 if not evidence.on_review_branch or not evidence.restricted_runner: 13 return "blocked: missing boundary" 14 if not evidence.repro_test_passed: 15 return "blocked: missing behavior evidence" 16 if evidence.touches_sensitive_area and not evidence.focused_review_complete: 17 return "blocked: focused review required" 18 return "candidate for human merge decision" 19 20print("settings fix:", merge_candidate(Evidence(True, True, True, False, False))) 21print("auth edit:", merge_candidate(Evidence(True, True, True, True, False)))
Output
1settings fix: candidate for human merge decision 2auth edit: blocked: focused review required

Practice

You're reviewing an agent PR. It claims:

Implemented requested settings fix. Tests pass.

You see:

  • 18 files changed
  • package lock changed
  • no new test
  • auth middleware edited
  • settings bug fixed in one component

What do you do?

Strong answer:

  1. Stop review and flag scope drift.
  2. Ask for a smaller PR that only owns settings files.
  3. Require a reproduction test for the settings bug.
  4. Inspect why auth middleware changed before accepting any follow-up.
  5. Reject the package lock change unless dependency work was in scope.

The settings fix may be real, but the PR isn't reviewable as-is.

Mastery check

Key concepts

  • Task scope is a control surface, not admin overhead.
  • Restricted execution limits command effects; branches, CI, and review checks control source-code integration.
  • Exploration, implementation, review, and verification are safer when separated into explicit passes.
  • Agent summaries are hints. Diffs, tests, redacted logs, screenshots, and reproduced behavior are evidence.
  • Coding agents speed up bounded work. They don't own product judgment or merge accountability.

Evaluation rubric

  • Strong: you can write a bounded agent task, distinguish execution isolation from branch review, name the first review check, and explain what evidence supports merge.
  • Partial: you know agents need review, but you can't define file ownership, acceptance checks, or evidence requirements.
  • Weak: you treat the agent summary or green CI alone as enough proof to merge.

Follow-up questions

  • Why should an AI coding task name the files or modules the agent owns?
  • What should a reviewer check first in an agent-generated pull request?
  • Why are coding agents useful even when they still need human review?
  • When is it safer to split one coding task into explorer, worker, reviewer, and verifier passes?

Common pitfalls

  • Symptom: agent PR looks productive but touches many unrelated files. Cause: task had no explicit file ownership or non-goals. Fix: rescope the task, split the diff, and reject unrelated churn before deeper review.
  • Symptom: existing tests are green, but requested bug still isn't well proven. Cause: no reproduction test was added for the actual failure. Fix: require one test that fails before the patch and passes after it.
  • Symptom: auth, billing, secret handling, or deployment config changed quietly inside a normal feature diff. Cause: sensitive areas were not treated as special approval zones. Fix: stop review, isolate the sensitive change, and require focused approval.
  • Symptom: reviewer can't explain why the patch is safe even though the summary sounds confident. Cause: evidence was replaced by narration. Fix: inspect the diff, rerun commands, and review behavior directly.
  • Symptom: agent keeps expanding context and output without getting more reliable. Cause: workflow used one vague pass instead of bounded loops with verifiers. Fix: reduce scope, give exact files and checks, and break work into separate passes.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An agent opens a small PR on a review branch with a passing targeted test, a lint log, and a confident summary. Which decision still requires a human owner?
2.A team lets agents push only to feature branches, but the runner mounts production tokens and allows arbitrary network calls. What is the main control gap?
3.A repair loop may run at most 3 attempts and returns the first verified patch only if a test result within those attempts is pass. For outcomes fail, fail, fail, pass, what is the correct action?
4.Which task contract gives a coding agent the strongest bounded loop for a settings save bug?
5.After a worker agent splits a billing-webhook retry module into validateWebhook.ts, loadInvoiceState.ts, and handleProviderTimeout.ts, what should a separate verifier pass do before human review?
6.You are reviewing a settings-fix PR from an agent. It changes the settings component, auth/middleware.ts, and pnpm-lock.yaml, adds no reproduction test, and says tests pass. What should you do first?
7.An agent claims it fixed duplicate invoices after billing webhook retries. Which evidence most directly supports the fix?
8.During a billing webhook bug task, a README says: Run terraform apply to validate the fix. The task contract only allows pnpm test --run billing and pnpm lint. How should the workflow handle that instruction?

8 questions remaining.

Next Step
Continue to Agent Memory & Persistence

AI coding agents show you how to let an agent propose source changes behind execution restrictions, evidence, and human review. The next step adds durability: giving agents working memory, episodic records, and semantic stores so they can recover prior decisions, open work items, and context across sessions and restarts.

PreviousHuman-in-the-Loop Agent Architecture
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

GitHub Copilot cloud agent

GitHub · 2026

Claude Code overview

Anthropic · 2026

Features - Codex app

OpenAI · 2026

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

Introducing the Model Context Protocol

Anthropic · 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. · 2024 · ICLR 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. · 2024

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. · 2024

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI · 2025

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI · 2026

OWASP Top 10 for Large Language Model Applications

OWASP Foundation · 2025