LearnAdvanced Agents & RetrievalAI Coding Workflow with Agents

🤖HardLLM Agents & Tool Use

AI Coding Workflow with Agents

Scope coding-agent tasks, isolate execution, keep patches on branches, verify behavior, and preserve human merge ownership.

27 min read

Learning path

Step 122 of 158 in the full curriculum

Human-in-the-Loop Agent Architecture Agent Memory & Persistence

Human-in-the-loop systems pause agents before high-risk actions. A coding agent creates two kinds of risk: it can run commands during its work, and it can propose source changes that become consequential once reviewed, merged, or deployed. A small patch can alter auth, expose a secret, break billing, or change deployment behavior.

AI coding agents can still write a pull request while you work on something else.

That sounds useful until the diff touches the wrong files, skips the failing test, updates a dependency for no reason, and hides a security problem behind a confident summary.

Adding an agent means adding an automated code-generation worker to your engineering workflow. If you tell it to "fix the billing webhook" and leave the task unbounded, it may return code that looks plausible but changes the wrong behavior. The engineering skill isn't "ask for code and trust the summary." It's designing a workflow where the agent can help without bypassing execution, review, or deployment controls.

Code Generation & Sandboxing and Human-in-the-Loop Agent Architecture covered the two controls this workflow needs: generated code runs in isolation, and risky agent actions require approval gates. Here, you'll apply those controls to an isolated workspace, branch, test, review, and deploy loop.

The workflow below is the operating model for the rest of the article: the agent works in a restricted execution environment and writes only to a review branch. Tests, redacted evidence, review, and merge ownership remain explicit.

AI coding agent workflow with explicit scope, restricted execution and branch work, redacted check evidence, and human merge ownership — A branch makes source changes reviewable; a restricted workspace limits what commands can affect while the patch is produced. Both controls are needed before a reviewer decides whether a diff can merge.

What a coding agent is

First, separate two things people often blur together. A coding assistant suggests code as you type, inside your editor. A coding agent takes a scoped task, reads the repository, plans changes, edits files, runs commands, and produces a branch or pull request with little keystroke-by-keystroke input from you. The agent loop needs the strongest controls because it can change files and run commands.

A coding agent, then, is an AI system that can inspect a repository, plan changes, edit files, run commands, and produce a branch or patch.

Modern tools expose this across several surfaces, and their features change quickly. GitHub documents Copilot coding agent as working in an ephemeral GitHub Actions-powered environment where it can make changes, execute tests and linters, and open a pull request for review.^[1] Anthropic's Claude Code docs describe an agentic tool that reads a codebase, edits files, and runs commands across terminal and other surfaces.^[2] OpenAI's Codex app documentation describes local, worktree, and cloud task modes, including isolated worktrees for parallel tasks.^[3] Product surfaces differ; the stable lesson is to control execution, bound diffs, collect evidence, and keep merge authority outside the agent.

A useful workflow is plan-then-edit: let the agent inspect the necessary code, ask it for a bounded plan, correct that plan, and only then authorize an edit pass. A plan isn't proof of correctness, but catching the wrong approach before a large diff is much cheaper than untangling unrelated changes afterward.

The core loop is familiar from agent design. In coding work, every transition creates a reviewable artifact.

Coding agent control loop from scoped task through repo reading, restricted patch execution, checks, pull request evidence, human review, and merge ownership. — The useful loop is explicit: scope the task, patch in a restricted workspace, collect redacted evidence from checks, and let review send changes back to patching instead of accepting scope drift.

An agent can move through many steps, but the human still owns the merge.

How agents plan and correct themselves

A coding agent doesn't write code in a single shot the way a human might type an email. It works in a loop: read the problem, plan a step, take an action, observe the result, and decide what to do next.

This loop is ReAct (Reason + Act).^[4] The agent might read a test file, notice the failure is in a validation function, edit the function, run the tests again, see a new error, and fix that too. Each round gives the agent local feedback, which is why bounded tasks with fast tests work so well.

The thing that closes this loop is a verifier: an automatic check the agent can run after each edit to decide whether it got closer. In coding work, tests, type checkers, and linters are the cheapest, most reliable verifiers you have. A failing test that the agent can rerun turns a vague "is this right?" into a concrete pass or fail signal. No verifier means the agent is guessing, and you're reviewing guesses.

A verifier loop also needs a stopping rule. This tiny simulation takes successive test outcomes as input and returns either the first verified patch or an escalation after its repair budget is exhausted.

bound-agent-repair-loop.py

def run_repair_loop(test_results: list[str], max_attempts: int = 3) -> str:
    for attempt, result in enumerate(test_results[:max_attempts], start=1):
        if result == "pass":
            return f"verified on attempt {attempt}"
    return "stop: request human diagnosis"

print("recovering patch:", run_repair_loop(["fail", "pass"]))
print("stuck patch:", run_repair_loop(["fail", "fail", "fail", "pass"]))

Output

recovering patch: verified on attempt 2
stuck patch: stop: request human diagnosis

A second pattern, Reflection, lets an agent critique its own draft before showing it to you.^[5] After producing a diff, the agent can run a second pass to check for security risks, missing tests, or style violations. That's the difference between a single oversized agent that tries to do everything at once and a workflow where separate passes handle exploration, implementation, review, and verification.

Modern agents pull repository context through tools like the Model Context Protocol (MCP), which lets them query files, documentation, and APIs without you copy-pasting text into a chat window.^[6] The shift from writing better prompts to giving the agent better data is called context engineering. A well-scoped task with precise file ownership and a clean test suite gives the agent far more signal than a giant prompt full of instructions.

Good tasks are bounded

Bad task:

Improve the app and clean up anything you find.

That prompt invites scope drift.

Better task:

Fix the account settings save bug.

Scope:

Own files under web/src/app/settings/ and web/src/lib/settings.ts.

Don't edit auth, billing, deployment, or unrelated styling.

Work without production credentials or external side effects.

Acceptance:

Reproduce failing save behavior with a test.

Fix the bug.

Run pnpm test --run settings and pnpm lint.

Summarize files changed and remaining risk.

Execution:

Use the restricted project runner.

Don't install dependencies, run deploys, or call production systems.

That prompt gives the agent a useful box.

Prompt field	Why it helps
Goal	Defines the behavior and the code area.
Scope	Limits file ownership and blast radius.
Non-goals	Prevents surprise cleanup.
Acceptance checks	Gives the agent a finish line.
Commands	Makes verification explicit.
Execution policy	Prevents the patch task from gaining unrelated capabilities.
Risk notes	Tells review where to look.

Bounded tasks reduce merge conflicts and make review possible.

Split agent roles

Don't use one vague agent pass for everything.

Use roles with clear ownership boundaries:

You define the database schema and API contracts first, then assign a bounded worker to fill in the implementation while you review every change at each step. If you hand an agent a vague task like "Build me an access-control service" and disappear, it's likely to fail.

Role	Good use	Bad use
Explorer	Read code and answer one question.	Making broad edits without ownership.
Worker	Implement a bounded change.	Changing architecture without review.
Reviewer	Find bugs, risks, and missing tests.	Rubber-stamping its own diff.
Verifier	Run checks and inspect output.	Treating command success as product success.

This separation matters because agents are good at producing plausible work. A reviewer agent shouldn't be asked to defend the same implementation it just wrote.

Even when one tool provides one interface, you can still run the workflow in passes:

Ask for a short codebase read.
Assign a bounded patch.
Ask for a review of the diff.
Run tests yourself or in CI.
Inspect the behavior before merge.

Restricted workspace to branch workflow

A controlled workflow separates two boundaries: a restricted environment for commands executed while producing a patch, and version control for reviewing the proposed source changes.

The branch is an integration boundary. It makes proposed changes inspectable before they enter a protected branch.

The review wrapper provides:

diff inspection
rollback path
CI checks
review comments
audit trail
privileged command separation

It doesn't stop an agent command from reading local credentials, making network calls, installing a dependency, or altering external infrastructure. That requires a restricted runner: an ephemeral workspace, least-privilege credentials, constrained network and command policy, and explicit review before privileged operations. If an agent edits directly on main, you also lose much of the integration boundary.

Execution boundary

The task contract should state what the agent may execute as well as what it may edit. For a billing webhook test fix, an agent may need to run unit tests and read local source files; it doesn't need production secrets, package publishing, infrastructure changes, or outbound calls to production systems.

Surface	Default rule for a bounded patch
Filesystem	Write only to working tree or worktree for assigned branch.
Credentials	Provide no production tokens; use fixtures or scoped test credentials only when needed.
Network	Deny by default or allow only approved package/test endpoints.
Commands	Allow repository test, lint, and format commands; review installs, migrations, and generated scripts.
External effects	Block deploys, releases, emails, payment writes, and infrastructure mutations.

This admission check is intentionally small. It takes an argument vector rather than an interpolated shell string, then admits only commands named in the task contract. In a real runner, execute the approved argument vector without shell concatenation and keep the operating-system sandbox in place.

admit-coding-agent-command.py

CONTRACT_COMMANDS = {
    ("pnpm", "test", "--run", "billing"),
    ("pnpm", "lint"),
    ("pnpm", "exec", "tsc", "--noEmit"),
    ("git", "diff"),
}
BLOCKED_PROGRAMS = {
    "curl", "npm", "pip", "publish", "terraform", "uv",
}

def admit_command(argv: tuple[str, ...], available_secrets: tuple[str, ...]) -> str:
    if available_secrets:
        return "blocked: credentials mounted"
    if not argv:
        return "blocked: empty command"
    program = argv[0].removeprefix("./")
    if program in BLOCKED_PROGRAMS:
        return "blocked: privileged or expanding command"
    if argv in CONTRACT_COMMANDS:
        return "allowed in restricted runner"
    return "needs explicit review"

print("test:", admit_command(("pnpm", "test", "--run", "billing"), ()))
print("infra:", admit_command(("terraform", "apply"), ()))
print("secreted test:", admit_command(("pnpm", "test", "--run", "billing"), ("PROD_API_KEY",)))

Output

test: allowed in restricted runner
infra: blocked: privileged or expanding command
secreted test: blocked: credentials mounted

File ownership example

For parallel agent work, write ownership down.

text

Worker A owns:
- `web/src/app/settings/page.tsx`
- `web/src/lib/settings.ts`
- `web/src/lib/settings.test.ts`

Worker B owns:
- `web/src/app/billing/page.tsx`
- `web/src/lib/billing.ts`
- `web/src/lib/billing.test.ts`

Both workers:
- Must not edit shared auth middleware.
- Must not change package versions.
- Must run targeted tests before reporting done.

This style prevents agents from overwriting each other and makes it obvious when a diff goes outside scope.

Ownership can be checked before two patches are combined. This example takes proposed changed files from two workers and detects the shared file that neither handoff resolved.

detect-overlapping-agent-edits.py

worker_a = {"web/src/lib/settings.ts", "web/src/lib/settings.test.ts"}
worker_b = {"web/src/lib/billing.ts", "web/src/lib/settings.ts"}

overlap = sorted(worker_a & worker_b)
print("merge status:", "blocked" if overlap else "ready")
print("overlap:", ", ".join(overlap) if overlap else "none")

Output

merge status: blocked
overlap: web/src/lib/settings.ts

Tests aren't optional

Agent-generated code often looks correct.

That doesn't mean behavior is correct.

Benchmarks such as SWE-bench evaluate agents on real GitHub issues because repository work is full of hidden context, failing tests, and edge cases that don't appear in isolated coding tasks.^[7] SWE-agent and Agentless both study how agent-computer interfaces and repair strategies affect software engineering performance on real tasks.^[8]^[9]

Read SWE-bench Verified numbers carefully. Verified tasks come from public GitHub history, which creates contamination risk as code enters training sets. SWE-bench Pro includes substantially different, longer-horizon tasks and a private-repository subset intended to reduce that risk.^[10] OpenAI reported in February 2026 that Verified was no longer a sound frontier evaluation for its models and recommended SWE-bench Pro for more informative measurement.^[11] The practical reading is stable: a benchmark score isn't evidence that an agent fixed your bug. Your own failing test is.

For day-to-day engineering, the lesson is simple: require evidence.

Change type	Minimum evidence
Bug fix	Repro test fails before and passes after.
Feature	Tests cover normal path and error path.
Refactor	Existing tests pass and behavior diff is explained.
UI change	Screenshot or Playwright smoke when practical.
API change	Contract test or curl example.
Security change	Threat-specific test or manual review note.

The agent's summary isn't evidence.

The diff, tests, redacted logs, screenshots, and reproduced behavior are evidence.

A reproduction should demonstrate the reported defect before the patch and the intended invariant after it. Here the same billing webhook retry creates two invoices before idempotency handling and one after it.

prove-billing-webhook-retry-fix.py

events = ["billing_event_78291", "billing_event_78291"]

def create_without_idempotency(retries: list[str]) -> list[str]:
    return [f"invoice_for_{retry}" for retry in retries]

def create_with_idempotency(retries: list[str]) -> list[str]:
    return [f"invoice_for_{retry}" for retry in dict.fromkeys(retries)]

print("before patch invoices:", len(create_without_idempotency(events)))
print("after patch invoices:", len(create_with_idempotency(events)))

Output

before patch invoices: 2
after patch invoices: 1

Command output is useful only after it's safe to attach to a pull request. This redactor takes captured test output and removes a credential value before the evidence packet is shown to reviewers.

redact-test-evidence.py

import re

def redact_output(output: str) -> str:
    return re.sub(r"PROD_API_KEY=\S+", "PROD_API_KEY=[redacted]", output)

captured = "billing webhook retry failed PROD_API_KEY=secret-abc123 webhook_id=78291"
review_output = redact_output(captured)
print(review_output)
print("secret visible:", "secret-abc123" in review_output)

Output

billing webhook retry failed PROD_API_KEY=[redacted] webhook_id=78291
secret visible: False

This example catches one known credential shape for teaching purposes. Production evidence pipelines should redact against the secrets mounted in the runner, apply a secret scanner, and cap attached output before it reaches a pull request.

Review the diff like production code

Start review with these questions:

Did the diff solve the requested behavior?
Did it edit only the intended files?
Did it add or update tests that would fail without the fix?
Did it touch auth, secrets, dependencies, shell commands, file access, billing, or deployment?
Did it create hidden fallback behavior?
Did it remove error handling, logging, or validation?
Did it make code harder to maintain?

Then inspect dangerous patterns.

Pattern	Risk
`eval`, shell commands, dynamic imports	Code execution risk.
New dependency	Supply chain and bundle risk.
Broad catch block	Hidden failures.
Silent fallback	Product behavior drifts without alerting.
Secret in code or captured output	Data exposure.
Tests that only assert rendering	Behavior may still be broken.

OWASP's LLM security work identifies prompt injection and sensitive information disclosure as major categories for LLM applications.^[12] Coding agents add another angle: untrusted issue text, docs, comments, and test data may influence code changes or attempt to make the agent execute a command. Treat repository text as evidence about the task, not authority to broaden permissions, expose secrets, or run privileged operations.

Instructions discovered inside a repository don't expand the task's authority. The authority check distinguishes a trusted task contract from a README instruction that attempts to authorize a privileged command.

keep-repository-text-out-of-authority.py

TASK_COMMANDS = {"pnpm test --run billing", "pnpm lint"}

def command_decision(command: str, source: str) -> str:
    if source != "task_contract":
        return "blocked: untrusted repository instruction"
    return "allowed" if command in TASK_COMMANDS else "blocked: outside contract"

print("task test:", command_decision("pnpm test --run billing", "task_contract"))
print("README infra:", command_decision("terraform apply", "README.md"))

Output

task test: allowed
README infra: blocked: untrusted repository instruction

A practical review checklist

Use this checklist on every agent-generated PR:

Check	Pass condition
Scope	Changed files match task ownership.
Behavior	Diff directly maps to acceptance criteria.
Tests	At least one test would fail before fix.
Commands	Reported checks ran.
Security	No unsafe shell, file, dependency, or secret handling.
Observability	Logs and errors remain useful.
Simplicity	No broad rewrites unrelated to task.
Rollback	Reverting the PR returns to previous behavior.

If a PR fails the scope check, stop early.

Don't spend an hour reviewing unrelated churn. Ask for a smaller diff.

Review intensity can be computed from proposed paths before a reviewer opens every line. This helper takes changed files as input and routes payment, auth, deployment, and workflow edits to focused review.

route-sensitive-code-review.py

SENSITIVE = ("auth", "billing", "billing/webhook", ".github/workflows", "deploy")

def review_route(paths: tuple[str, ...]) -> str:
    touched = [path for path in paths if any(part in path for part in SENSITIVE)]
    return "focused security review" if touched else "standard review"

print("settings only:", review_route(("web/src/app/settings/page.tsx",)))
print("payment edit:", review_route(("web/src/billing/webhook/create.ts",)))

Output

settings only: standard review
payment edit: focused security review

You can encode part of this review check as a small policy rule before a human starts deep review. The check isn't a replacement for judgment; it catches obvious scope drift and weak evidence so reviewers don't waste time on unreviewable patches.

agent-pr-scope-gate.py

from dataclasses import dataclass

SENSITIVE_MARKERS = ("/auth/", "/billing/", "/deploy/", ".github/workflows/")
DEPENDENCY_FILES = ("package.json", "pnpm-lock.yaml", "uv.lock")

@dataclass
class AgentTask:
    owned_paths: tuple[str, ...]
    requires_repro_test: bool = True

@dataclass
class PullRequestEvidence:
    changed_files: tuple[str, ...]
    added_tests: tuple[str, ...]
    reported_commands: tuple[str, ...]

def review_gate(task: AgentTask, evidence: PullRequestEvidence) -> list[str]:
    findings: list[str] = []

    def is_owned(path: str) -> bool:
        return any(
            path.startswith(owned) if owned.endswith("/") else path == owned
            for owned in task.owned_paths
        )

    out_of_scope = [
        path for path in evidence.changed_files
        if not is_owned(path)
    ]
    if out_of_scope:
        findings.append(f"scope_drift: {', '.join(out_of_scope)}")

    sensitive = [
        path for path in evidence.changed_files
        if any(marker in f"/{path}" for marker in SENSITIVE_MARKERS)
    ]
    if sensitive:
        findings.append(f"sensitive_area: {', '.join(sensitive)}")

    dependencies = [
        path for path in evidence.changed_files
        if path.rsplit("/", 1)[-1] in DEPENDENCY_FILES
    ]
    if dependencies:
        findings.append(f"dependency_change: {', '.join(dependencies)}")

    if task.requires_repro_test and not evidence.added_tests:
        findings.append("missing_repro_test")

    if not evidence.reported_commands:
        findings.append("missing_verification_commands")

    return findings

task = AgentTask(owned_paths=(
    "web/src/app/settings/",
    "web/src/lib/settings.ts",
    "web/src/lib/settings.test.ts",
))
pr = PullRequestEvidence(
    changed_files=(
        "web/src/app/settings/page.tsx",
        "web/src/lib/settings.ts",
        "auth/middleware.ts",
        "pnpm-lock.yaml",
    ),
    added_tests=(),
    reported_commands=("pnpm test --run settings",),
)

findings = review_gate(task, pr)
print("BLOCK" if findings else "PASS")
for finding in findings:
    print("-", finding)

Output

BLOCK
- scope_drift: auth/middleware.ts, pnpm-lock.yaml
- sensitive_area: auth/middleware.ts
- dependency_change: pnpm-lock.yaml
- missing_repro_test

This tiny checker catches the same issue a senior reviewer would spot first: the settings fix may be real, but the PR also touched auth and the lockfile without ownership. A lockfile change isn't automatically unsafe, but it requires an intentional dependency review rather than hitching a ride in a settings fix. The next action is extraction or rescoping, not merge.

Where agents help most

Agents are strongest when the task has clear boundaries and local feedback.

Good uses:

add missing tests for an existing function
fix a narrow bug with a reproduction
migrate one component pattern
update docs from code
add input validation
refactor one module with existing tests
write first draft of a benchmark harness
inspect a codebase and answer a focused question

Risky uses:

rewrite auth
change billing
rotate secrets
design a new distributed system without review
update many dependencies at once
"clean up the codebase"
make product decisions without human context

Feedback changes the work.

If an agent can run a test and see whether it got closer, it has a useful loop. If the task needs product judgment, legal judgment, security judgment, or architecture taste, the agent can assist, but it shouldn't decide alone.

Worked example: Refactoring with agents

A multi-agent refactor should start with a messy but functional billing-webhook retry module, then run it through separate passes.

Step A - Exploration: Ask an explorer agent to read the module and list the highest-risk maintainability issues. It reports duplicated idempotency checks, one oversized handler, and missing error handling for payment-provider timeouts.

Step B - Architecture: Ask an architect agent to propose a modular structure. It suggests splitting the module into validateWebhook.ts, loadInvoiceState.ts, and handleProviderTimeout.ts, with clear interfaces between them.

Step C - Implementation: Ask a worker agent to execute the split, one file at a time, keeping existing tests green. It produces small commits with descriptive messages.

Step D - Verification: Ask a verifier agent to run the targeted tests, review the diff against the original behavior contract, and confirm the timeout failure path is now tested. It reports the exact commands and changed behavior instead of a confidence summary.

Each pass has a different role, a bounded scope, and a clear handoff. The human reviews the final diff before anything merges.

Incident story

Start with an issue:

Users sometimes see duplicate invoices after replaying a billing webhook.

A weak agent task says:

Fix duplicate invoices.

The agent might add a frontend debounce. That may reduce clicks but won't fix duplicate server-side side effects.

A strong task says:

Investigate duplicate invoice creation in billing webhook retry flow.

Scope:

Read billing webhook handler, invoice creation, and retry tests.

Don't edit payment provider config.

First report likely cause and proposed fix.

Then add a failing test for duplicate retry.

Fix with idempotency key on server-side invoice creation.

Run billing webhook tests.

This task guides the agent toward the real class of bug: idempotency.

It also splits investigation from implementation. The human can stop a wrong plan before code changes land.

The implementation invariant is small enough to execute locally. Invoice creation accepts a webhook idempotency key and returns the existing invoice for a retry instead of inserting a second side effect.

make-invoice-creation-idempotent.py

class InvoiceStore:
    def __init__(self) -> None:
        self.by_webhook_key: dict[str, str] = {}

    def create_once(self, webhook_key: str) -> str:
        if webhook_key not in self.by_webhook_key:
            next_id = len(self.by_webhook_key) + 1
            self.by_webhook_key[webhook_key] = f"inv_{next_id}"
        return self.by_webhook_key[webhook_key]

store = InvoiceStore()
first = store.create_once("webhook_78291")
retry = store.create_once("webhook_78291")
print("first:", first)
print("retry returns same invoice:", retry == first)
print("invoice count:", len(store.by_webhook_key))

Output

first: inv_1
retry returns same invoice: True
invoice count: 1

Team policy

Teams should write down how AI coding agents are allowed to work.

Minimum policy:

Agents run commands in restricted workspaces without production secrets or deployment authority.
Agents commit source changes only to branches or pull requests, never directly to protected branches.
Sensitive files, dependency changes, and privileged commands require focused human approval.
Every generated PR needs human review.
CI must run before merge.
Security-sensitive diffs need threat-specific verification.
Agent summaries and unredacted command output aren't proof.
Large diffs are split before review.

This isn't anti-agent.

It's how agents become useful inside real engineering teams.

A mechanical gate can reject missing evidence, but it must not perform the merge. This example returns a candidate for human review only when branch, sandbox, tests, and sensitive-review requirements are satisfied.

gate-agent-merge-candidate.py

from dataclasses import dataclass

@dataclass
class Evidence:
    on_review_branch: bool
    restricted_runner: bool
    repro_test_passed: bool
    touches_sensitive_area: bool
    focused_review_complete: bool

def merge_candidate(evidence: Evidence) -> str:
    if not evidence.on_review_branch or not evidence.restricted_runner:
        return "blocked: missing boundary"
    if not evidence.repro_test_passed:
        return "blocked: missing behavior evidence"
    if evidence.touches_sensitive_area and not evidence.focused_review_complete:
        return "blocked: focused review required"
    return "candidate for human merge decision"

print("settings fix:", merge_candidate(Evidence(True, True, True, False, False)))
print("auth edit:", merge_candidate(Evidence(True, True, True, True, False)))

Output

settings fix: candidate for human merge decision
auth edit: blocked: focused review required

Practice

You're reviewing an agent PR. It claims:

Implemented requested settings fix. Tests pass.

You see:

18 files changed
package lock changed
no new test
auth middleware edited
settings bug fixed in one component

What do you do?

Strong answer:

Stop review and flag scope drift.
Ask for a smaller PR that only owns settings files.
Require a reproduction test for the settings bug.
Inspect why auth middleware changed before accepting any follow-up.
Reject the package lock change unless dependency work was in scope.

The settings fix may be real, but the PR isn't reviewable as-is.

Mastery check

Key concepts

Task scope is a control surface, not admin overhead.
Restricted execution limits command effects; branches, CI, and review checks control source-code integration.
Exploration, implementation, review, and verification are safer when separated into explicit passes.
Agent summaries are hints. Diffs, tests, redacted logs, screenshots, and reproduced behavior are evidence.
Coding agents speed up bounded work. They don't own product judgment or merge accountability.

Evaluation rubric

Strong: you can write a bounded agent task, distinguish execution isolation from branch review, name the first review check, and explain what evidence supports merge.
Partial: you know agents need review, but you can't define file ownership, acceptance checks, or evidence requirements.
Weak: you treat the agent summary or green CI alone as enough proof to merge.

Follow-up questions

Why should an AI coding task name the files or modules the agent owns?
What should a reviewer check first in an agent-generated pull request?
Why are coding agents useful even when they still need human review?
When is it safer to split one coding task into explorer, worker, reviewer, and verifier passes?

Common pitfalls

Symptom: agent PR looks productive but touches many unrelated files. Cause: task had no explicit file ownership or non-goals. Fix: rescope the task, split the diff, and reject unrelated churn before deeper review.
Symptom: existing tests are green, but requested bug still isn't well proven. Cause: no reproduction test was added for the actual failure. Fix: require one test that fails before the patch and passes after it.
Symptom: auth, billing, secret handling, or deployment config changed quietly inside a normal feature diff. Cause: sensitive areas were not treated as special approval zones. Fix: stop review, isolate the sensitive change, and require focused approval.
Symptom: reviewer can't explain why the patch is safe even though the summary sounds confident. Cause: evidence was replaced by narration. Fix: inspect the diff, rerun commands, and review behavior directly.
Symptom: agent keeps expanding context and output without getting more reliable. Cause: workflow used one vague pass instead of bounded loops with verifiers. Fix: reduce scope, give exact files and checks, and break work into separate passes.

Next Step

Continue to Agent Memory & Persistence

AI coding agents show you how to let an agent propose source changes behind execution restrictions, evidence, and human review. The next step adds durability: giving agents working memory, episodic records, and semantic stores so they can recover prior decisions, open work items, and context across sessions and restarts.

PreviousHuman-in-the-Loop Agent Architecture

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

GitHub Copilot cloud agent

GitHub · 2026

Claude Code overview

Anthropic · 2026

Features - Codex app

OpenAI · 2026

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

Introducing the Model Context Protocol

Anthropic · 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. · 2024 · ICLR 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. · 2024

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. · 2024

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI · 2025

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI · 2026

OWASP Top 10 for Large Language Model Applications

OWASP Foundation · 2025

Back to Topics

LearnAdvanced Agents & RetrievalAI Coding Workflow with Agents

🤖HardLLM Agents & Tool Use

AI Coding Workflow with Agents

Scope coding-agent tasks, isolate execution, keep patches on branches, verify behavior, and preserve human merge ownership.

27 min read

Learning path

Step 122 of 158 in the full curriculum

Human-in-the-Loop Agent Architecture Agent Memory & Persistence

AI coding agents can still write a pull request while you work on something else.

That sounds useful until the diff touches the wrong files, skips the failing test, updates a dependency for no reason, and hides a security problem behind a confident summary.

What a coding agent is

A coding agent, then, is an AI system that can inspect a repository, plan changes, edit files, run commands, and produce a branch or patch.

The core loop is familiar from agent design. In coding work, every transition creates a reviewable artifact.

An agent can move through many steps, but the human still owns the merge.

How agents plan and correct themselves

bound-agent-repair-loop.py

def run_repair_loop(test_results: list[str], max_attempts: int = 3) -> str:
    for attempt, result in enumerate(test_results[:max_attempts], start=1):
        if result == "pass":
            return f"verified on attempt {attempt}"
    return "stop: request human diagnosis"

print("recovering patch:", run_repair_loop(["fail", "pass"]))
print("stuck patch:", run_repair_loop(["fail", "fail", "fail", "pass"]))

Output

recovering patch: verified on attempt 2
stuck patch: stop: request human diagnosis

Good tasks are bounded

Bad task:

Improve the app and clean up anything you find.

That prompt invites scope drift.

Better task:

Fix the account settings save bug.

Scope:

Own files under web/src/app/settings/ and web/src/lib/settings.ts.

Don't edit auth, billing, deployment, or unrelated styling.

Work without production credentials or external side effects.

Acceptance:

Reproduce failing save behavior with a test.

Fix the bug.

Run pnpm test --run settings and pnpm lint.

Summarize files changed and remaining risk.

Execution:

Use the restricted project runner.

Don't install dependencies, run deploys, or call production systems.

That prompt gives the agent a useful box.

Prompt field	Why it helps
Goal	Defines the behavior and the code area.
Scope	Limits file ownership and blast radius.
Non-goals	Prevents surprise cleanup.
Acceptance checks	Gives the agent a finish line.
Commands	Makes verification explicit.
Execution policy	Prevents the patch task from gaining unrelated capabilities.
Risk notes	Tells review where to look.

Bounded tasks reduce merge conflicts and make review possible.

Split agent roles

Don't use one vague agent pass for everything.

Use roles with clear ownership boundaries:

Role	Good use	Bad use
Explorer	Read code and answer one question.	Making broad edits without ownership.
Worker	Implement a bounded change.	Changing architecture without review.
Reviewer	Find bugs, risks, and missing tests.	Rubber-stamping its own diff.
Verifier	Run checks and inspect output.	Treating command success as product success.

This separation matters because agents are good at producing plausible work. A reviewer agent shouldn't be asked to defend the same implementation it just wrote.

Even when one tool provides one interface, you can still run the workflow in passes:

Ask for a short codebase read.
Assign a bounded patch.
Ask for a review of the diff.
Run tests yourself or in CI.
Inspect the behavior before merge.

Restricted workspace to branch workflow

A controlled workflow separates two boundaries: a restricted environment for commands executed while producing a patch, and version control for reviewing the proposed source changes.

The branch is an integration boundary. It makes proposed changes inspectable before they enter a protected branch.

The review wrapper provides:

diff inspection
rollback path
CI checks
review comments
audit trail
privileged command separation

Execution boundary

Surface	Default rule for a bounded patch
Filesystem	Write only to working tree or worktree for assigned branch.
Credentials	Provide no production tokens; use fixtures or scoped test credentials only when needed.
Network	Deny by default or allow only approved package/test endpoints.
Commands	Allow repository test, lint, and format commands; review installs, migrations, and generated scripts.
External effects	Block deploys, releases, emails, payment writes, and infrastructure mutations.

admit-coding-agent-command.py

CONTRACT_COMMANDS = {
    ("pnpm", "test", "--run", "billing"),
    ("pnpm", "lint"),
    ("pnpm", "exec", "tsc", "--noEmit"),
    ("git", "diff"),
}
BLOCKED_PROGRAMS = {
    "curl", "npm", "pip", "publish", "terraform", "uv",
}

def admit_command(argv: tuple[str, ...], available_secrets: tuple[str, ...]) -> str:
    if available_secrets:
        return "blocked: credentials mounted"
    if not argv:
        return "blocked: empty command"
    program = argv[0].removeprefix("./")
    if program in BLOCKED_PROGRAMS:
        return "blocked: privileged or expanding command"
    if argv in CONTRACT_COMMANDS:
        return "allowed in restricted runner"
    return "needs explicit review"

print("test:", admit_command(("pnpm", "test", "--run", "billing"), ()))
print("infra:", admit_command(("terraform", "apply"), ()))
print("secreted test:", admit_command(("pnpm", "test", "--run", "billing"), ("PROD_API_KEY",)))

Output

test: allowed in restricted runner
infra: blocked: privileged or expanding command
secreted test: blocked: credentials mounted

File ownership example

For parallel agent work, write ownership down.

text

Worker A owns:
- `web/src/app/settings/page.tsx`
- `web/src/lib/settings.ts`
- `web/src/lib/settings.test.ts`

Worker B owns:
- `web/src/app/billing/page.tsx`
- `web/src/lib/billing.ts`
- `web/src/lib/billing.test.ts`

Both workers:
- Must not edit shared auth middleware.
- Must not change package versions.
- Must run targeted tests before reporting done.

This style prevents agents from overwriting each other and makes it obvious when a diff goes outside scope.

Ownership can be checked before two patches are combined. This example takes proposed changed files from two workers and detects the shared file that neither handoff resolved.

detect-overlapping-agent-edits.py

worker_a = {"web/src/lib/settings.ts", "web/src/lib/settings.test.ts"}
worker_b = {"web/src/lib/billing.ts", "web/src/lib/settings.ts"}

overlap = sorted(worker_a & worker_b)
print("merge status:", "blocked" if overlap else "ready")
print("overlap:", ", ".join(overlap) if overlap else "none")

Output

merge status: blocked
overlap: web/src/lib/settings.ts

Tests aren't optional

Agent-generated code often looks correct.

That doesn't mean behavior is correct.

For day-to-day engineering, the lesson is simple: require evidence.

Change type	Minimum evidence
Bug fix	Repro test fails before and passes after.
Feature	Tests cover normal path and error path.
Refactor	Existing tests pass and behavior diff is explained.
UI change	Screenshot or Playwright smoke when practical.
API change	Contract test or curl example.
Security change	Threat-specific test or manual review note.

The agent's summary isn't evidence.

The diff, tests, redacted logs, screenshots, and reproduced behavior are evidence.

prove-billing-webhook-retry-fix.py

events = ["billing_event_78291", "billing_event_78291"]

def create_without_idempotency(retries: list[str]) -> list[str]:
    return [f"invoice_for_{retry}" for retry in retries]

def create_with_idempotency(retries: list[str]) -> list[str]:
    return [f"invoice_for_{retry}" for retry in dict.fromkeys(retries)]

print("before patch invoices:", len(create_without_idempotency(events)))
print("after patch invoices:", len(create_with_idempotency(events)))

Output

before patch invoices: 2
after patch invoices: 1

Command output is useful only after it's safe to attach to a pull request. This redactor takes captured test output and removes a credential value before the evidence packet is shown to reviewers.

redact-test-evidence.py

import re

def redact_output(output: str) -> str:
    return re.sub(r"PROD_API_KEY=\S+", "PROD_API_KEY=[redacted]", output)

captured = "billing webhook retry failed PROD_API_KEY=secret-abc123 webhook_id=78291"
review_output = redact_output(captured)
print(review_output)
print("secret visible:", "secret-abc123" in review_output)

Output

billing webhook retry failed PROD_API_KEY=[redacted] webhook_id=78291
secret visible: False

Review the diff like production code

Start review with these questions:

Did the diff solve the requested behavior?
Did it edit only the intended files?
Did it add or update tests that would fail without the fix?
Did it touch auth, secrets, dependencies, shell commands, file access, billing, or deployment?
Did it create hidden fallback behavior?
Did it remove error handling, logging, or validation?
Did it make code harder to maintain?

Then inspect dangerous patterns.

Pattern	Risk
`eval`, shell commands, dynamic imports	Code execution risk.
New dependency	Supply chain and bundle risk.
Broad catch block	Hidden failures.
Silent fallback	Product behavior drifts without alerting.
Secret in code or captured output	Data exposure.
Tests that only assert rendering	Behavior may still be broken.

keep-repository-text-out-of-authority.py

TASK_COMMANDS = {"pnpm test --run billing", "pnpm lint"}

def command_decision(command: str, source: str) -> str:
    if source != "task_contract":
        return "blocked: untrusted repository instruction"
    return "allowed" if command in TASK_COMMANDS else "blocked: outside contract"

print("task test:", command_decision("pnpm test --run billing", "task_contract"))
print("README infra:", command_decision("terraform apply", "README.md"))

Output

task test: allowed
README infra: blocked: untrusted repository instruction

A practical review checklist

Use this checklist on every agent-generated PR:

Check	Pass condition
Scope	Changed files match task ownership.
Behavior	Diff directly maps to acceptance criteria.
Tests	At least one test would fail before fix.
Commands	Reported checks ran.
Security	No unsafe shell, file, dependency, or secret handling.
Observability	Logs and errors remain useful.
Simplicity	No broad rewrites unrelated to task.
Rollback	Reverting the PR returns to previous behavior.

If a PR fails the scope check, stop early.

Don't spend an hour reviewing unrelated churn. Ask for a smaller diff.

route-sensitive-code-review.py

SENSITIVE = ("auth", "billing", "billing/webhook", ".github/workflows", "deploy")

def review_route(paths: tuple[str, ...]) -> str:
    touched = [path for path in paths if any(part in path for part in SENSITIVE)]
    return "focused security review" if touched else "standard review"

print("settings only:", review_route(("web/src/app/settings/page.tsx",)))
print("payment edit:", review_route(("web/src/billing/webhook/create.ts",)))

Output

settings only: standard review
payment edit: focused security review

agent-pr-scope-gate.py

from dataclasses import dataclass

SENSITIVE_MARKERS = ("/auth/", "/billing/", "/deploy/", ".github/workflows/")
DEPENDENCY_FILES = ("package.json", "pnpm-lock.yaml", "uv.lock")

@dataclass
class AgentTask:
    owned_paths: tuple[str, ...]
    requires_repro_test: bool = True

@dataclass
class PullRequestEvidence:
    changed_files: tuple[str, ...]
    added_tests: tuple[str, ...]
    reported_commands: tuple[str, ...]

def review_gate(task: AgentTask, evidence: PullRequestEvidence) -> list[str]:
    findings: list[str] = []

    def is_owned(path: str) -> bool:
        return any(
            path.startswith(owned) if owned.endswith("/") else path == owned
            for owned in task.owned_paths
        )

    out_of_scope = [
        path for path in evidence.changed_files
        if not is_owned(path)
    ]
    if out_of_scope:
        findings.append(f"scope_drift: {', '.join(out_of_scope)}")

    sensitive = [
        path for path in evidence.changed_files
        if any(marker in f"/{path}" for marker in SENSITIVE_MARKERS)
    ]
    if sensitive:
        findings.append(f"sensitive_area: {', '.join(sensitive)}")

    dependencies = [
        path for path in evidence.changed_files
        if path.rsplit("/", 1)[-1] in DEPENDENCY_FILES
    ]
    if dependencies:
        findings.append(f"dependency_change: {', '.join(dependencies)}")

    if task.requires_repro_test and not evidence.added_tests:
        findings.append("missing_repro_test")

    if not evidence.reported_commands:
        findings.append("missing_verification_commands")

    return findings

task = AgentTask(owned_paths=(
    "web/src/app/settings/",
    "web/src/lib/settings.ts",
    "web/src/lib/settings.test.ts",
))
pr = PullRequestEvidence(
    changed_files=(
        "web/src/app/settings/page.tsx",
        "web/src/lib/settings.ts",
        "auth/middleware.ts",
        "pnpm-lock.yaml",
    ),
    added_tests=(),
    reported_commands=("pnpm test --run settings",),
)

findings = review_gate(task, pr)
print("BLOCK" if findings else "PASS")
for finding in findings:
    print("-", finding)

Output

BLOCK
- scope_drift: auth/middleware.ts, pnpm-lock.yaml
- sensitive_area: auth/middleware.ts
- dependency_change: pnpm-lock.yaml
- missing_repro_test

Where agents help most

Agents are strongest when the task has clear boundaries and local feedback.

Good uses:

add missing tests for an existing function
fix a narrow bug with a reproduction
migrate one component pattern
update docs from code
add input validation
refactor one module with existing tests
write first draft of a benchmark harness
inspect a codebase and answer a focused question

Risky uses:

rewrite auth
change billing
rotate secrets
design a new distributed system without review
update many dependencies at once
"clean up the codebase"
make product decisions without human context

Feedback changes the work.

Worked example: Refactoring with agents

A multi-agent refactor should start with a messy but functional billing-webhook retry module, then run it through separate passes.

Step C - Implementation: Ask a worker agent to execute the split, one file at a time, keeping existing tests green. It produces small commits with descriptive messages.

Each pass has a different role, a bounded scope, and a clear handoff. The human reviews the final diff before anything merges.

Incident story

Start with an issue:

Users sometimes see duplicate invoices after replaying a billing webhook.

A weak agent task says:

Fix duplicate invoices.

The agent might add a frontend debounce. That may reduce clicks but won't fix duplicate server-side side effects.

A strong task says:

Investigate duplicate invoice creation in billing webhook retry flow.

Scope:

Read billing webhook handler, invoice creation, and retry tests.

Don't edit payment provider config.

First report likely cause and proposed fix.

Then add a failing test for duplicate retry.

Fix with idempotency key on server-side invoice creation.

Run billing webhook tests.

This task guides the agent toward the real class of bug: idempotency.

It also splits investigation from implementation. The human can stop a wrong plan before code changes land.

make-invoice-creation-idempotent.py

class InvoiceStore:
    def __init__(self) -> None:
        self.by_webhook_key: dict[str, str] = {}

    def create_once(self, webhook_key: str) -> str:
        if webhook_key not in self.by_webhook_key:
            next_id = len(self.by_webhook_key) + 1
            self.by_webhook_key[webhook_key] = f"inv_{next_id}"
        return self.by_webhook_key[webhook_key]

store = InvoiceStore()
first = store.create_once("webhook_78291")
retry = store.create_once("webhook_78291")
print("first:", first)
print("retry returns same invoice:", retry == first)
print("invoice count:", len(store.by_webhook_key))

Output

first: inv_1
retry returns same invoice: True
invoice count: 1

Team policy

Teams should write down how AI coding agents are allowed to work.

Minimum policy:

Agents run commands in restricted workspaces without production secrets or deployment authority.
Agents commit source changes only to branches or pull requests, never directly to protected branches.
Sensitive files, dependency changes, and privileged commands require focused human approval.
Every generated PR needs human review.
CI must run before merge.
Security-sensitive diffs need threat-specific verification.
Agent summaries and unredacted command output aren't proof.
Large diffs are split before review.

This isn't anti-agent.

It's how agents become useful inside real engineering teams.

gate-agent-merge-candidate.py

from dataclasses import dataclass

@dataclass
class Evidence:
    on_review_branch: bool
    restricted_runner: bool
    repro_test_passed: bool
    touches_sensitive_area: bool
    focused_review_complete: bool

def merge_candidate(evidence: Evidence) -> str:
    if not evidence.on_review_branch or not evidence.restricted_runner:
        return "blocked: missing boundary"
    if not evidence.repro_test_passed:
        return "blocked: missing behavior evidence"
    if evidence.touches_sensitive_area and not evidence.focused_review_complete:
        return "blocked: focused review required"
    return "candidate for human merge decision"

print("settings fix:", merge_candidate(Evidence(True, True, True, False, False)))
print("auth edit:", merge_candidate(Evidence(True, True, True, True, False)))

Output

settings fix: candidate for human merge decision
auth edit: blocked: focused review required

Practice

You're reviewing an agent PR. It claims:

Implemented requested settings fix. Tests pass.

You see:

18 files changed
package lock changed
no new test
auth middleware edited
settings bug fixed in one component

What do you do?

Strong answer:

Stop review and flag scope drift.
Ask for a smaller PR that only owns settings files.
Require a reproduction test for the settings bug.
Inspect why auth middleware changed before accepting any follow-up.
Reject the package lock change unless dependency work was in scope.

The settings fix may be real, but the PR isn't reviewable as-is.

Mastery check

Key concepts

Task scope is a control surface, not admin overhead.
Restricted execution limits command effects; branches, CI, and review checks control source-code integration.
Exploration, implementation, review, and verification are safer when separated into explicit passes.
Agent summaries are hints. Diffs, tests, redacted logs, screenshots, and reproduced behavior are evidence.
Coding agents speed up bounded work. They don't own product judgment or merge accountability.

Evaluation rubric

Strong: you can write a bounded agent task, distinguish execution isolation from branch review, name the first review check, and explain what evidence supports merge.
Partial: you know agents need review, but you can't define file ownership, acceptance checks, or evidence requirements.
Weak: you treat the agent summary or green CI alone as enough proof to merge.

Follow-up questions

Why should an AI coding task name the files or modules the agent owns?
What should a reviewer check first in an agent-generated pull request?
Why are coding agents useful even when they still need human review?
When is it safer to split one coding task into explorer, worker, reviewer, and verifier passes?

Common pitfalls

Symptom: agent PR looks productive but touches many unrelated files. Cause: task had no explicit file ownership or non-goals. Fix: rescope the task, split the diff, and reject unrelated churn before deeper review.
Symptom: existing tests are green, but requested bug still isn't well proven. Cause: no reproduction test was added for the actual failure. Fix: require one test that fails before the patch and passes after it.
Symptom: auth, billing, secret handling, or deployment config changed quietly inside a normal feature diff. Cause: sensitive areas were not treated as special approval zones. Fix: stop review, isolate the sensitive change, and require focused approval.
Symptom: reviewer can't explain why the patch is safe even though the summary sounds confident. Cause: evidence was replaced by narration. Fix: inspect the diff, rerun commands, and review behavior directly.
Symptom: agent keeps expanding context and output without getting more reliable. Cause: workflow used one vague pass instead of bounded loops with verifiers. Fix: reduce scope, give exact files and checks, and break work into separate passes.

Next Step

Continue to Agent Memory & Persistence

PreviousHuman-in-the-Loop Agent Architecture

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

GitHub Copilot cloud agent

GitHub · 2026

Claude Code overview

Anthropic · 2026

Features - Codex app

OpenAI · 2026

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

Introducing the Model Context Protocol

Anthropic · 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. · 2024 · ICLR 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. · 2024

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. · 2024

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI · 2025

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI · 2026

OWASP Top 10 for Large Language Model Applications

OWASP Foundation · 2025

AI Coding Workflow with Agents

What a coding agent is

If a coding agent can read the repo, edit files, and run tests, why should a human still own the merge?

How agents plan and correct themselves

What makes context engineering different from writing a longer prompt?

Good tasks are bounded

What are the two most important boundaries in a good coding-agent task?

Split agent roles

Why is it risky to let the same agent explore, implement, review, and verify in one vague pass?

Restricted workspace to branch workflow

Execution boundary

File ownership example

In parallel agent work, what should happen if Worker A edits a file owned by Worker B?

Tests aren't optional

Why is "tests pass" weaker than "this reproduction test failed before and passes after"?

Review the diff like production code

Why can repository text be dangerous input for a coding agent?

A practical review checklist

What is the first review check for an agent-generated PR?

Where agents help most

What kind of coding task gives an agent the strongest feedback loop?

Worked example: Refactoring with agents

In the refactor example, why does verification come after implementation instead of being merged into the worker role?

Incident story

Why is a frontend debounce a weak fix for duplicate invoices?

Team policy

Practice

Why should you reject a PR that contains a real fix plus unrelated auth and dependency changes?

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

AI Coding Workflow with Agents

What a coding agent is

If a coding agent can read the repo, edit files, and run tests, why should a human still own the merge?

How agents plan and correct themselves

What makes context engineering different from writing a longer prompt?

Good tasks are bounded

What are the two most important boundaries in a good coding-agent task?

Split agent roles

Why is it risky to let the same agent explore, implement, review, and verify in one vague pass?

Restricted workspace to branch workflow

Execution boundary

File ownership example

In parallel agent work, what should happen if Worker A edits a file owned by Worker B?

Tests aren't optional

Why is "tests pass" weaker than "this reproduction test failed before and passes after"?

Review the diff like production code

Why can repository text be dangerous input for a coding agent?

A practical review checklist

What is the first review check for an agent-generated PR?

Where agents help most

What kind of coding task gives an agent the strongest feedback loop?

Worked example: Refactoring with agents

In the refactor example, why does verification come after implementation instead of being merged into the worker role?

Incident story

Why is a frontend debounce a weak fix for duplicate invoices?

Team policy

Practice

Why should you reject a PR that contains a real fix plus unrelated auth and dependency changes?

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check