LearnAdvanced Agents & RetrievalMulti-Agent Orchestration

🤖HardLLM Agents & Tool Use

Multi-Agent Orchestration

Master multi-agent orchestration with LangGraph, AutoGen teams, and OpenAI handoffs. Learn DAG-style routing, typed shared state, protocol boundaries, and human-in-the-loop controls for reliable AI systems.

44 min read

Learning path

Step 125 of 158 in the full curriculum

Recursive Language Models (RLM)Capstone: Production Agent

Agent failure handling gives one worker a recovery plan. Multi-agent orchestration starts when one agent has too many responsibilities and the job needs typed nodes, explicit edges, persistent state, and approval points.

Multi-agent orchestration splits a hard workflow into specialized steps with explicit dependencies. A graph makes control flow clearer when one model call shouldn't plan, retrieve, decide, and act alone. It becomes safer only when it also limits capabilities, validates merged evidence, and binds approvals to exact side effects.

One release engineer shouldn't inspect eval results, check canary metrics, verify rollout policy, and write the incident update while other releases wait. Strong platform teams assign bounded roles: one worker reads evals, another checks service health, a third validates policy, and a coordinator owns the handoff.

Long-running agent workflows face a similar coordination problem. As one transcript accumulates instructions, tool results, and intermediate decisions inside a limited context window, relevant state becomes harder to isolate. Multi-agent orchestration splits the work into specialized agents connected by an explicit workflow graph. Instead of one prompt that tries to do everything, you build a small pipeline where a triage agent classifies the request, a deployment-state agent queries the release store, and a response formatter composes the reply.

You have already practiced the pieces this pattern needs: structured outputs, ReAct loops, human-in-the-loop (HITL) approval gates, memory, and failure recovery. If those feel fuzzy, review the prerequisite lessons first. Here, the next step is wiring multiple specialized agents into a graph so they collaborate on a single job without hiding state in a transcript.

AutoGen^{[1]Reference 1AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.https://arxiv.org/abs/2308.08155} and MetaGPT^{[2]Reference 2MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.https://arxiv.org/abs/2308.00352} are useful early examples of role-based coordination. AutoGen evaluated multi-agent conversation patterns across coding, math, and question answering. MetaGPT organized software agents around explicit standard operating procedures (SOPs). Together, they made the systems-design questions concrete: who owns state, which worker acts next, and how does the runtime stop?

The honest case: when multi-agent helps and when it hurts

Before you reach for a graph of agents, take the cost seriously. First-party engineering reports now make the boundary clearer: parallel exploration can pay off, while coupled decisions and unbounded writes make multi-agent systems harder to control.

Cognition, the team behind Devin^{[3]Reference 3Introducing Devin, the first AI software engineerhttps://docs.devin.ai/get-started/devin-intro}, published a widely read argument titled "Don't Build Multi-Agents."^{[4]Reference 4Don't Build Multi-Agentshttps://cognition.ai/blog/dont-build-multi-agents} Their position: when you split work across parallel subagents, each subagent acts on its own slice of context. Because every action carries implicit decisions, those decisions conflict when the subagents can't see each other's full traces. The system can look parallel while producing inconsistent, hard-to-reconcile output. Their recommended default is a single agent with strong context engineering, and where you must split work, share the complete trace rather than isolated messages.

In a 2026 follow-up, Cognition updated that position with a narrower pattern they said was working in practice: multiple agents may contribute intelligence, but writes stay single-threaded.^{[5]Reference 5Multi-Agents: What's Actually Workinghttps://cognition.ai/blog/multi-agents-working} That refinement is central to production design. Parallel research, retrieval, and proposal branches don't justify parallel traffic shifts, data deletion, or incident-publication writes.

One day after Cognition's original June 2025 post, Anthropic published "How we built our multi-agent research system"^{[6]Reference 6How we built our multi-agent research systemhttps://www.anthropic.com/engineering/multi-agent-research-system} describing a different choice for their Research feature: a lead agent that spawns 3 to 5 subagents, each exploring a different part of the question in its own context window. Their internal research eval showed this setup outperformed a single Opus 4 agent by 90.2%. The catch they state plainly: multi-agent systems use about 15 times more tokens than chats, so the task value has to justify the larger token budget. They also name where multi-agent is a poor fit: domains where every agent must share the same context or agent dependencies are heavy. They cite coding as exactly that kind of poor fit, because subtasks are tightly coupled and agents struggle to coordinate in real time.

Read together, Cognition's warning and refinement plus Anthropic's research system describe the same trade-off from different task shapes:

Signal	Favors a single strong agent	Favors multiple agents
Subtask coupling	Tight, decisions depend on each other	Loose, subtasks explore independent slices
Context sharing	Every step needs the full trace	Each branch needs only its own slice
Parallelism	Little, the work is sequential	High, branches run at the same time
Context size	Fits one window with good engineering	Exceeds one window, needs compression
Task value	Routine, cost-sensitive	High value, can pay a much larger token budget
External writes	One bounded writer	Parallel evidence only, then one controlled writer

The practical rule for 2026: reach for multi-agent only when subtasks are genuinely separable and parallel, and when the task value justifies the token and failure-surface cost. Otherwise a single agent with disciplined context engineering is cheaper and easier to debug. The rest of the design assumes you have cleared that bar and now need to wire the graph correctly.

A small admission check keeps that decision concrete: parallel branches win only when work is independent, worth the extra execution, and able to merge through a clear contract.

admit-multi-agent-work.py

from dataclasses import dataclass

@dataclass(frozen=True)
class TaskShape:
    independent_branches: int
    needs_shared_trace: bool
    value_justifies_extra_cost: bool
    merge_contract_defined: bool

def choose_orchestration(task: TaskShape) -> str:
    if (
        task.independent_branches >= 2
        and not task.needs_shared_trace
        and task.value_justifies_extra_cost
        and task.merge_contract_defined
    ):
        return "parallel graph"
    return "single scoped agent"

research = TaskShape(4, False, True, True)
code_edit = TaskShape(3, True, True, False)

print("research:", choose_orchestration(research))
print("coupled code edit:", choose_orchestration(code_edit))

Output

research: parallel graph
coupled code edit: single scoped agent

Why linear agent chains break

Multi-agent DAG orchestration diagram showing an operator request routed into three read-only branches for retrieval, policy, and drafting, then merged through an evidence gate before one controlled release write. — The request fans out only into read-only branches. Retrieval, policy, and drafting can overlap, but the merge gate must collect receipts before the single release write.

In the simplest case, an agent pipeline is a straight line: the user asks a question, agent A looks it up, agent B analyzes it, and agent C replies. For a narrow task like "What does this runbook say?", a chain works fine. But production-release questions are messy. An on-call engineer might ask, "Can we promote reranker-v17, and did the canary burn error budget?" That touches evals, deployment state, and policy at the same time. A rigid assembly line forces you to process these in sequence even when they could run in parallel.

The problem with linear chains

A naive pipeline processes steps sequentially, handing output directly from one task to the next. The visual below compares that rigid path with a DAG route that can branch by intent and merge results before the final answer.

Serial versus DAG routing comparison showing sequential waits and parallel evidence branches. — Compare the critical path: serial routing waits through every step, while the DAG overlaps independent read branches and still merges before publishing one answer.

Linear chains break the moment you need:

Conditional routing: A promotion request needs the policy agent, not the log analyst.
Parallel execution: Checking eval results, service health, and rollout policy can happen at the same time.
Cycles with conditions: If the release ID returns nothing, go back and search by service name instead.
State-dependent routing: The next edge depends on verified fields or a recorded classification.

What a DAG gives you

A Directed Acyclic Graph (DAG) models your agent workflow as nodes (agents or functions) and edges (dependencies). It defines routing rules: "If the Triage Agent says 'deployment health', go to the Health Analyst. If it says 'promotion request', go to the Policy Agent." A DAG gives you branching and parallelism without back-edges.

Once you add retry loops, review loops, or approval loops, you're no longer dealing with a strict DAG. You're dealing with a state machine or a more general graph. LangGraph supports both. In the DAG route above, the triage node can fan out to specialists and every successful branch still converges on the formatter without cycling.

Key property

The control flow is explicit. Given the same state and routing decisions, you can inspect why execution moved from one node to the next. That's much easier to debug than an open-ended conversation loop where an LLM silently decides the next speaker.

Explicit routing isn't authorization. Keep read-only evidence branches separate from any node that can shift traffic, roll back a service, or send an external notification. A write-capable node must still require policy checks, a bound approval when needed, and an idempotency key.

Building a workflow graph in LangGraph

LangGraph is a graph-based framework for multi-agent orchestration. Its core primitives are a state schema, node functions, and edges.^{[7]Reference 7LangGraph Graph APIhttps://docs.langchain.com/oss/python/langgraph/graph-api}

State: the shared whiteboard

A shared TypedDict (a Python type hint that adds static typing to dictionaries) flows through the graph as the memory every agent reads and updates. The schema keeps the release-health graph compact: message history, the current routing step, retrieved deployment data, and an error count. In production, sensitive or large release records should remain in an access-controlled store; graph state should carry a scoped reference plus the verified fields downstream nodes need.

state-the-shared-whiteboard.py

from typing import Annotated
from typing_extensions import TypedDict
from langchain_core.messages import AIMessage, AnyMessage
from langgraph.graph.message import add_messages

DeploymentData = dict[str, str]

class AgentState(TypedDict):
    # Messages accumulate using the add_messages reducer
    messages: Annotated[list[AnyMessage], add_messages]
    current_step: str
    deployment_data: DeploymentData | None
    service_health: str | None
    final_answer: str | None
    error_count: int
    next_worker: str | None

Message reducer: The Annotated[list[AnyMessage], add_messages] pattern tells LangGraph to merge updates by appending rather than overwriting. If two parallel branches both append messages, you get both sets in the final list.

Nodes: the workers

Nodes are standard Python functions that transform the graph's state. The functions below implement a deployment lookup and a health analysis step. Each takes the current AgentState, performs external read calls (like a release-store query or metrics API call), and returns a dictionary of state updates. LangGraph merges those updates into the global state automatically.

In the snippet, deployment_db and metrics_api are application adapters. The complete local smoke test after the routing section uses in-memory adapters so you can run the graph without API keys.

nodes-the-workers.py

def lookup_node(state: AgentState) -> dict[str, object]:
    """Deployment lookup agent: queries the release store."""
    query = state["messages"][-1].content
    # Example: using a search tool or database wrapper
    results = deployment_db.search(query)
    return {
        "deployment_data": results,
        "current_step": "health_check",
        # Appends to messages list due to Annotated[list, add_messages]
        "messages": [AIMessage(content=f"Found release {results['release_id']}")],
    }

def health_node(state: AgentState) -> dict[str, object]:
    """Health analyst: checks service metrics for the release."""
    deployment = state["deployment_data"]
    if deployment is None:
        raise RuntimeError("Missing deployment data")
    health = metrics_api.health(deployment["release_id"])
    return {
        "service_health": health.status,
        "final_answer": f"Release {deployment['release_id']} health is {health.status}.",
        "current_step": "format",
        "messages": [AIMessage(content=f"Service health: {health.status}")],
    }

Notice that each node returns only the fields it changes. LangGraph patches those into the existing state rather than replacing the whole object. That means the deployment_data produced by lookup_node is still visible when health_node runs, even though health_node doesn't return deployment_data again.

Edges: the routing logic

Conditional edges define the graph's control flow based on the current state. A triage node can write a routing decision into shared state, and a small router function can read that decision and return the next node name. That keeps the classification logic inside a node while keeping the edge function deterministic and easy to inspect.

edges-the-routing-logic.py

from typing import Literal
from langchain_core.messages import AIMessage
from langgraph.graph import StateGraph, START, END

Route = Literal["lookup_agent", "faq_agent", "policy_agent"]

def triage_node(state: AgentState) -> dict[str, object]:
    user_query = state["messages"][-1].content.lower()

    if any(token in user_query for token in ("promote", "rollback", "policy")):
        route: Route = "policy_agent"
    elif any(token in user_query for token in ("deploy", "release", "canary")):
        route = "lookup_agent"
    else:
        route = "faq_agent"

    return {"next_worker": route}

def faq_node(state: AgentState) -> dict[str, object]:
    answer = llm.invoke(
        f"Answer briefly from the FAQ: {state['messages'][-1].content}"
    )
    return {"final_answer": answer.content, "current_step": "format"}

def policy_node(state: AgentState) -> dict[str, object]:
    answer = llm.invoke(
        f"Explain the rollout policy for: {state['messages'][-1].content}"
    )
    return {"final_answer": answer.content, "current_step": "format"}

def format_node(state: AgentState) -> dict[str, object]:
    return {
        "current_step": "complete",
        "messages": [AIMessage(content=state["final_answer"] or "No answer generated yet.")],
    }

def router(state: AgentState) -> Route:
    return state["next_worker"] or "faq_agent"

# Build the graph
builder = StateGraph(AgentState)
builder.add_node("triage", triage_node)
builder.add_node("lookup_agent", lookup_node)
builder.add_node("faq_agent", faq_node)
builder.add_node("policy_agent", policy_node)
builder.add_node("health_analyst", health_node)
builder.add_node("response_formatter", format_node)

builder.add_edge(START, "triage")

# Conditional edges handle dynamic routing
builder.add_conditional_edges("triage", router)

# Standard edges define fixed transitions and the final fan-in
builder.add_edge("lookup_agent", "health_analyst")
builder.add_edge("health_analyst", "response_formatter")
builder.add_edge("faq_agent", "response_formatter")
builder.add_edge("policy_agent", "response_formatter")
builder.add_edge("response_formatter", END)

app = builder.compile()

If you invoke this graph with the query "Can we release search-api?", the trace looks like this:

triage_node reads the query, writes "lookup_agent" into next_worker.
The router returns "lookup_agent", so LangGraph calls lookup_node.
lookup_node fetches release data, writes deployment_data, and sets current_step to "health_check".
The fixed edge lookup_agent -> health_analyst triggers health_node.
health_node queries metrics, writes service_health, stores final_answer, and sets current_step to "format".
The fixed edge health_analyst -> response_formatter triggers format_node.
format_node reads final_answer and appends the final message.
The fixed edge response_formatter -> END terminates the run.

Production tip: Keep router functions deterministic. If the routing decision depends on an LLM call, perform that call inside a node (like triage_node), write the result into state, and let the edge function read it. That way you can log the exact routing decision without re-running the LLM during debugging.

Routing also needs a capability boundary. A requester mentioning promotion may route to a policy lookup and review proposal, but it must not shift production traffic merely because a specialist node exists.

constrain-route-capabilities.py

ROUTES = {
    "release": ("eval_read", "deployment_read", "draft_summary"),
    "promotion": ("eval_read", "policy_read", "propose_promotion", "human_review"),
}

def route_for(message: str) -> tuple[str, ...]:
    intent = "promotion" if "promote" in message.lower() else "release"
    return ROUTES[intent]

path = route_for("Please promote reranker-v17")
assert "shift_traffic" not in path
print("route:", " -> ".join(path))
print("external write included:", "shift_traffic" in path)

Output

route: eval_read -> policy_read -> propose_promotion -> human_review
external write included: False

The same graph can run as a complete local smoke test. It uses fake release and health data, but the LangGraph state, nodes, conditional edge, fixed edges, and invocation are real:

edges-the-routing-logic-2.py

from typing import Annotated, Literal
from typing_extensions import TypedDict

from langchain_core.messages import AIMessage, AnyMessage, HumanMessage
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages

DeploymentData = dict[str, str]

class AgentState(TypedDict):
    messages: Annotated[list[AnyMessage], add_messages]
    current_step: str
    deployment_data: DeploymentData | None
    service_health: str | None
    final_answer: str | None
    error_count: int
    next_worker: str | None

DEPLOYMENTS = {"search-api": {"service_id": "search-api", "release_id": "rel-7421"}}
HEALTH = {"rel-7421": "error_budget_ok"}
Route = Literal["lookup_agent", "faq_agent", "policy_agent"]

def triage_node(state: AgentState) -> dict[str, object]:
    user_query = state["messages"][-1].content.lower()
    if any(token in user_query for token in ("promote", "rollback", "policy")):
        route: Route = "policy_agent"
    elif any(token in user_query for token in ("deploy", "release", "canary")):
        route = "lookup_agent"
    else:
        route = "faq_agent"
    return {"next_worker": route}

def router(state: AgentState) -> Route:
    return state["next_worker"] or "faq_agent"

def lookup_node(state: AgentState) -> dict[str, object]:
    query = state["messages"][-1].content
    service_id = "search-api" if "search" in query else "unknown"
    deployment = DEPLOYMENTS[service_id]
    return {
        "deployment_data": deployment,
        "current_step": "health_check",
        "messages": [AIMessage(content=f"Found release {deployment['release_id']}")],
    }

def health_node(state: AgentState) -> dict[str, object]:
    deployment = state["deployment_data"]
    if deployment is None:
        raise ValueError("deployment_data must be populated before health_node runs")
    health = HEALTH[deployment["release_id"]]
    return {
        "service_health": health,
        "final_answer": f"Release {deployment['release_id']} health is {health}.",
        "current_step": "format",
        "messages": [AIMessage(content=f"Service health: {health}")],
    }

def faq_node(state: AgentState) -> dict[str, object]:
    return {"final_answer": "FAQ answer", "current_step": "format"}

def policy_node(state: AgentState) -> dict[str, object]:
    return {"final_answer": "Policy answer", "current_step": "format"}

def format_node(state: AgentState) -> dict[str, object]:
    return {
        "current_step": "complete",
        "messages": [
            AIMessage(content=state["final_answer"] or "No answer generated yet.")
        ],
    }

builder = StateGraph(AgentState)
builder.add_node("triage", triage_node)
builder.add_node("lookup_agent", lookup_node)
builder.add_node("faq_agent", faq_node)
builder.add_node("policy_agent", policy_node)
builder.add_node("health_analyst", health_node)
builder.add_node("response_formatter", format_node)

builder.add_edge(START, "triage")
builder.add_conditional_edges("triage", router)
builder.add_edge("lookup_agent", "health_analyst")
builder.add_edge("health_analyst", "response_formatter")
builder.add_edge("faq_agent", "response_formatter")
builder.add_edge("policy_agent", "response_formatter")
builder.add_edge("response_formatter", END)

app = builder.compile()
result = app.invoke({
    "messages": [HumanMessage(content="Can we release search-api?")],
    "current_step": "start",
    "deployment_data": None,
    "service_health": None,
    "final_answer": None,
    "error_count": 0,
    "next_worker": None,
})

print("service_health:", result["service_health"])
print("final message:", result["messages"][-1].content)
print("route completed:", result["service_health"] == "error_budget_ok")

Output

service_health: error_budget_ok
final message: Release rel-7421 health is error_budget_ok.
route completed: True

Four ways to organize a team

Once you can build a single graph, the next question is how to arrange agents when the problem grows. Start with dependency shape, then choose how much routing authority stays centralized.

Team-topology selector: independent work maps to map-reduce, centralized routing to supervisor, and larger domains to hierarchy or swarm. — Choose topology from coordination need. Use map-reduce for independent work, supervisor when one router should own delegation, and hierarchy or swarm when routing spans larger domains or live peer handoffs.

Supervisor agent

One LLM acts as a manager that delegates tasks to specialized workers. That pattern works when the sub-steps aren't known in advance. In our release system, a supervisor might look at the conversation history and decide whether to call the deployment lookup agent, the policy agent, or a human escalation path.

supervisor-agent.py

from langchain_core.messages import SystemMessage

def supervisor_node(state: AgentState) -> dict[str, object]:
    """Supervisor decides which worker to invoke next."""
    response = llm.invoke([
        SystemMessage(content="""You are a supervisor managing these workers:
        - deployment_lookup: Queries the release store
        - health_analyst: Checks service health metrics
        - policy_agent: Explains rollout and rollback rules
        - response_formatter: Writes the final operator reply

        Based on the conversation, decide which worker to invoke next,
        or respond with FINISH if the task is complete."""),
        *state["messages"]
    ])
    return {"messages": [response], "next_worker": response.content}

When it helps

Supervisors offer simple, centralized control. When a promotion request turns out to need both deployment lookup and policy explanation, the supervisor can sequence them explicitly rather than hard-coding every combination in the graph edges.

When it hurts

Supervisors become a bottleneck and a single point of failure. Every step costs an extra LLM call, which adds latency and token cost. Supervisors also rely on the LLM outputting structured routing instructions (like "deployment_lookup"), which can be brittle with smaller models.

Hierarchical teams

For complex enterprise workflows, one supervisor can become a wide routing bottleneck. A hierarchy introduces supervisors of supervisors. In a large AI platform, a "Release Manager" agent might manage an "Evaluation Team" (a sub-supervisor coordinating benchmark and regression checks) and an "Operations Team" (a sub-supervisor coordinating canary health and rollback readiness).

When it helps

Hierarchical structures scale for massive tasks and offer clear separation of concerns. No single manager has to remember every detail of every sub-team.

When it hurts

Deep hierarchies add latency. Handoff dilution can occur when the original operator intent loses detail as it passes through multiple manager layers. A request that needs only one release-store lookup might now take three LLM calls just to decide who does the lookup.

Swarm (peer-to-peer handoff)

Agents hand off to each other directly without a long-lived supervisor. OpenAI's experimental Swarm repo^{[8]Reference 8Swarm: Educational Framework for Multi-Agent Orchestration.https://github.com/openai/swarm} made this pattern easy to study, and the current OpenAI Agents SDK exposes the same idea through first-class handoffs.^{[9]Reference 9OpenAI Agents SDKhttps://github.com/openai/openai-agents-python}^{[10]Reference 10OpenAI Agents SDK Handoffshttps://openai.github.io/openai-agents-python/handoffs/} Handoffs are represented to the model as transfer tools, and input filters can control what history the receiving agent sees.^{[10]Reference 10OpenAI Agents SDK Handoffshttps://openai.github.io/openai-agents-python/handoffs/} In the example below, the triage agent doesn't solve the task itself. It selects the specialist that should take over.

swarm-peer-to-peer-handoff.py

import asyncio
from agents import Agent, Runner

release_agent = Agent(
    name="Release Agent",
    handoff_description="Handles model-promotion questions and rollout plans.",
    instructions="You explain release status clearly and concisely.",
)

health_agent = Agent(
    name="Health Agent",
    handoff_description="Handles service metrics and canary-health updates.",
    instructions="You look up deployment health and explain current signals.",
)

triage_agent = Agent(
    name="Triage Agent",
    instructions="Route the operator to one specialist.",
    handoffs=[release_agent, health_agent],
)

async def main():
    result = await Runner.run(
        triage_agent,
        "Can we promote reranker-v17 from shadow to canary?",
    )
    print(result.final_output)
    print(f"Answered by: {result.last_agent.name}")

asyncio.run(main())

If the orchestrator should keep ownership of the conversation and only call specialists for sub-tasks, use specialists as tools instead of handoffs.

When it helps

There's no supervisor bottleneck, agents are loosely coupled, and it's easy to add new specialists.

When it hurts

It's harder to debug because the flow is less explicit. Circular handoffs (A -> B -> A) are possible, so peer-to-peer flows need trace logging and hard turn limits.

Map-reduce agents

This pattern uses a parallel fan-out for "embarrassingly parallel" tasks, followed by an aggregation step. It's ideal for tasks like checking a release from multiple perspectives (offline evals, canary metrics, rollback readiness). In the topology visual above, this is the fan-out shape: one router emits independent analyst tasks, and a reducer merges their findings into one release recommendation.

To implement a true map-reduce pattern in current LangGraph, the cleanest option is the Send API. It lets one node emit a variable number of parallel tasks, each with its own payload, and then merge the branch outputs through reducers.^{[7]Reference 7LangGraph Graph APIhttps://docs.langchain.com/oss/python/langgraph/graph-api}

map-reduce-agents.py

import operator
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Send

class MapReduceState(TypedDict):
    items: list[str]
    findings: Annotated[list[str], operator.add]
    combined_report: str

class AnalystTask(TypedDict):
    item: str

def fan_out_node(state: MapReduceState) -> dict[str, object]:
    return {"items": ["eval_gate", "canary_health", "rollback_plan"]}

def analyst_node(state: AnalystTask) -> dict[str, object]:
    item = state["item"]
    return {"findings": [f"{item}: checked and ready"]}

def route_to_analysts(state: MapReduceState) -> list[Send]:
    return [Send("analyst", {"item": item}) for item in state["items"]]

def reducer_node(state: MapReduceState) -> dict[str, object]:
    return {"combined_report": "\n".join(state["findings"])}

builder = StateGraph(MapReduceState)
builder.add_node("fan_out", fan_out_node)
builder.add_node("analyst", analyst_node)
builder.add_node("reducer", reducer_node)

builder.add_edge(START, "fan_out")
builder.add_conditional_edges("fan_out", route_to_analysts, ["analyst"])
builder.add_edge("analyst", "reducer")
builder.add_edge("reducer", END)

graph = builder.compile()

result = graph.invoke({"items": [], "findings": [], "combined_report": ""})
print(result["combined_report"])
print("finding count:", len(result["findings"]))
print("contains canary health:", "canary_health" in result["combined_report"])

Output

eval_gate: checked and ready
canary_health: checked and ready
rollback_plan: checked and ready
finding count: 3
contains canary health: True

One practical detail: parallel branches should either write disjoint state keys or use reducers for shared keys such as Annotated[list[str], operator.add]. Otherwise the runtime has no safe way to merge their updates.

The reducer is also a trust boundary. It should admit only findings for the current tenant and request, with recorded evidence, before any formatter or decision node consumes them.

validate-fan-in-evidence.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Finding:
    tenant_id: str
    claim: str
    source_ref: str
    verified: bool

def reduce_findings(findings: list[Finding], tenant_id: str) -> list[str]:
    accepted: list[str] = []
    for finding in findings:
        if finding.tenant_id != tenant_id:
            raise ValueError("tenant mismatch")
        if not finding.verified or not finding.source_ref:
            raise ValueError("missing verified evidence")
        accepted.append(finding.claim)
    return accepted

safe = [Finding("tenant-7", "canary error budget ok", "metrics:evt-81", True)]
foreign = [Finding("tenant-8", "promotion eligible", "evals:92", True)]

print("accepted:", reduce_findings(safe, "tenant-7"))
try:
    reduce_findings(safe + foreign, "tenant-7")
except ValueError as exc:
    print("blocked merge:", exc)

Output

accepted: ['canary error budget ok']
blocked merge: tenant mismatch

Reducers must also reject incompatible scalar writes. Two branches may append independent findings; they shouldn't silently choose different traffic targets or rollback decisions.

reject-conflicting-parallel-writes.py

def merge_updates(updates: list[dict[str, object]]) -> dict[str, object]:
    merged_findings: list[str] = []
    proposed_traffic: str | None = None
    for update in updates:
        merged_findings.extend(update.get("findings", []))
        if "proposed_traffic" in update:
            candidate = str(update["proposed_traffic"])
            if proposed_traffic is not None and candidate != proposed_traffic:
                raise ValueError("conflicting traffic proposals")
            proposed_traffic = candidate
    return {"findings": merged_findings, "proposed_traffic": proposed_traffic}

parallel_reads = [{"findings": ["eval verified"]}, {"findings": ["policy found"]}]
print("findings:", merge_updates(parallel_reads)["findings"])
try:
    merge_updates([{"proposed_traffic": "10%"}, {"proposed_traffic": "25%"}])
except ValueError as exc:
    print("blocked write merge:", exc)

Output

findings: ['eval verified', 'policy found']
blocked write merge: conflicting traffic proposals

Saving progress when agents crash

Checkpointing and persistence

In production, agents crash, APIs time out, and servers restart. You can't rely on in-memory state. LangGraph's InMemorySaver is convenient for local development, but production deployments usually use a durable checkpointer such as Postgres. The current docs use PostgresSaver.from_conn_string(...); one common pattern is wrapping it in a context manager so the runtime can persist each step and resume by thread_id after a crash.^{[11]Reference 11LangGraph Persistencehttps://docs.langchain.com/oss/python/langgraph/persistence}

checkpointing-and-persistence.py

from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://postgres:postgres@localhost:5442/postgres?sslmode=disable"

with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    # Run once when initializing the checkpoint tables
    # checkpointer.setup()

    app = graph.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "user-123-conversation-456"}}
    result = app.invoke(initial_state, config=config)

Production tip: Use a durable checkpointer for workflows that must survive restarts or pause for approval. Reusing the same thread_id lets the runtime resume from the latest checkpoint and inspect prior state snapshots during debugging.

One more durability rule matters in production: resumed runs replay from a checkpoint boundary instead of jumping back to the exact Python line where execution paused. That means side effects and non-deterministic work should either live in LangGraph tasks or be made idempotent.^{[11]Reference 11LangGraph Persistencehttps://docs.langchain.com/oss/python/langgraph/persistence}

The storage boundary, not the model, must enforce idempotency. If the execution node is replayed after a timeout, the same key retrieves the first result rather than applying a second traffic shift.

deduplicate-replayed-effects.py

processed_promotions: dict[str, str] = {}

def promote_release(idempotency_key: str, traffic_percent: int) -> str:
    if idempotency_key in processed_promotions:
        return f"reused {processed_promotions[idempotency_key]}"
    promotion_id = f"promotion-{len(processed_promotions) + 1}"
    processed_promotions[idempotency_key] = promotion_id
    return f"created {promotion_id} for {traffic_percent}% traffic"

key = "promotion:reranker-v17:25pct:v3"
print("first run:", promote_release(key, 25))
print("resumed run:", promote_release(key, 25))
print("promotions applied:", len(processed_promotions))

Output

first run: created promotion-1 for 25% traffic
resumed run: reused promotion-1
promotions applied: 1

Getting human approval before critical steps

Autonomous agents are rarely fully trusted in production. You need breakpoints where a human can approve, modify, or reject an agent's plan. A large traffic shift, safety-filter change, or rollback that affects a critical endpoint are all cases where the graph should pause and wait. Approval must apply to one exact proposed action, not to the conversation in general.

LangGraph supports this natively through interrupt(), backed by a checkpointer.^{[12]Reference 12LangGraph Interruptshttps://docs.langchain.com/oss/python/langgraph/interrupts} Static interrupt_before / interrupt_after breakpoints still exist, but the current docs position them as debugging aids rather than production approval flows. In a production HITL flow, the node emits a structured approval request, the caller sees that payload under __interrupt__, and the graph resumes with Command(resume=...).

getting-human-approval-before-critical-steps.py

import hashlib
import json
from typing import Literal
from langgraph.types import Command, interrupt

def action_digest(action: dict[str, object]) -> str:
    encoded = json.dumps(action, sort_keys=True, separators=(",", ":")).encode()
    return hashlib.sha256(encoded).hexdigest()

def approval_node(state: dict[str, object]) -> Command[Literal["execute", "cancel"]]:
    action = state["proposed_action"]
    digest = action_digest(action)
    decision = interrupt({
        "question": "Approve this promotion?",
        "action": action,
        "action_digest": digest,
        "release_version": state["release_version"],
        "idempotency_key": state["idempotency_key"],
        "evidence_refs": state["evidence_refs"],
    })
    approved = (
        decision.get("approved") is True
        and decision.get("action_digest") == digest
        and decision.get("release_version") == state["release_version"]
    )
    return Command(goto="execute" if approved else "cancel")

app = graph.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "promotion-review-reranker-v17"}}

# First call pauses and exposes the interrupt payload to the caller
paused = app.invoke(inputs, config=config)
print(paused["__interrupt__"])

# The UI resumes with the exact displayed digest and release version
decision = {
    "approved": True,
    "action_digest": displayed_digest,
    "release_version": displayed_release_version,
}
app.invoke(Command(resume=decision), config=config)

The downstream execute node must re-read the authoritative release version immediately before writing, require another review if it changed, and send the same idempotency key to the deployment service. One subtle runtime detail matters a lot in production: after resuming, LangGraph restarts the node from the top rather than continuing from the exact line of the interrupt(). Any side effects before the pause need to be idempotent or moved after the interrupt.

bind-approval-to-one-effect.py

import hashlib
import json

def proposal(traffic_percent: int, release_version: int) -> dict[str, object]:
    action = {
        "kind": "promotion",
        "model_id": "reranker-v17",
        "traffic_percent": traffic_percent,
    }
    digest = hashlib.sha256(json.dumps(action, sort_keys=True).encode()).hexdigest()
    return {"action": action, "digest": digest, "release_version": release_version}

def can_execute(packet: dict[str, object], decision: dict[str, object], live_version: int) -> bool:
    return (
        decision.get("approved") is True
        and decision.get("digest") == packet["digest"]
        and decision.get("release_version") == packet["release_version"] == live_version
    )

reviewed = proposal(25, 3)
decision = {"approved": True, "digest": reviewed["digest"], "release_version": 3}
changed_traffic = proposal(50, 3)

print("reviewed action admitted:", can_execute(reviewed, decision, live_version=3))
print("changed traffic admitted:", can_execute(changed_traffic, decision, live_version=3))
print("stale release admitted:", can_execute(reviewed, decision, live_version=4))

Output

reviewed action admitted: True
changed traffic admitted: False
stale release admitted: False

How agents share memory

Three multi-agent communication patterns compared side by side: a shared ledger where agents read and append state, a handoff chain where ownership moves agent by agent, and a supervisor hub that routes search, code, and review workers. — Pick the communication pattern by ownership: shared ledger for state traceability, handoff chain for one current task owner, and supervisor hub for routing specialist workers.

When orchestrating multiple agents, the choice of coordination contract shapes the whole architecture. In practice, three common patterns appear repeatedly: typed shared state, message-oriented coordination, and supervisor-based routing.

In a shared state model (used by LangGraph), a single typed object acts as a whiteboard. Every node reads from and writes to this schema. That makes intermediate variables explicit: if an early node fetches a release record, a node deep in the graph can still access it without repackaging it into a chat message. The downside is that the schema needs discipline. As the graph grows, the state can become bloated, and careless updates can overwrite important fields.

In a message-oriented model (common in AutoGen AgentChat teams), the conversation transcript becomes the main contract between agents. Agents coordinate by reading prior messages and producing new ones, which feels natural for delegation and review loops. Typing is weaker: the transcript can grow quickly, and downstream agents only know what the conversation history includes.

In a supervisor model, a central router or manager decides which specialist agent should act next and tracks overall progress. Use this when you want explicit control over delegation and completion criteria, but expect a central bottleneck and another policy layer to debug.

Approach	Framework	Pros	Cons
Shared State	LangGraph	Typed contracts, easy inspection of intermediates, clean checkpointing.	Schema discipline required, state can bloat over time.
Message-Oriented Teams	AutoGen	Natural delegation loops, flexible roles, transcript is easy to inspect.	Weaker typing, prompt history grows, context only exists if the transcript carries it.
Supervisor Routing	CrewAI / custom orchestrators	Clear delegation control, centralized progress tracking, easy policy insertion.	Coordinator can become a latency bottleneck and single point of failure.

Talking to agents outside your system

Shared state and message passing solve coordination inside one runtime. Production systems often need something else: interoperability between independent systems. Two standards matter here, but they solve different problems.

MCP: tools and context access

MCP (Model Context Protocol)^{[13]Reference 13Introducing the Model Context Protocolhttps://www.anthropic.com/news/model-context-protocol}, introduced by Anthropic in late 2024, standardizes how an assistant connects to tools, resources, and prompts exposed by an MCP server. It's best thought of as a tool and context protocol, not a full remote-agent delegation protocol.

A2A: remote agent delegation

A2A (Agent-to-Agent Protocol)^{[14]Reference 14A2A Protocol Specificationhttps://a2a-protocol.org/latest/specification/}^{[15]Reference 15A2A Protocol Ships v1.0: Production-Ready Standard for Agent-to-Agent Communicationhttps://a2a-protocol.org/latest/announcing-1.0/}, announced by Google in April 2025 and now developed as a broader protocol community standard, enables direct communication between autonomous agents from different providers. A2A focuses on task delegation, status updates, and result sharing between independent agent systems. This lets one vendor's planner agent delegate work to another vendor's specialist agent without flattening that specialist into a single tool call.

Why the distinction matters

The two protocols solve different layers of the agent interoperability problem. MCP is a context and capability protocol: it lets one host discover and invoke tools, read resources, and reuse prompts from compliant local or remote servers. A2A is a remote delegation protocol: it lets one autonomous agent hand off a complete sub-task to another independent agent (possibly from a different company), poll for progress, and receive structured results or intermediate state.

Without this distinction, teams often end up wrapping every remote capability as an ad-hoc tool call and then wonder why long-running workflows, approvals, and status updates become awkward to implement.

Concern	MCP (tool/context protocol)	A2A (agent delegation protocol)
Tool or data access	Strong fit (native)	Not the primary goal
Remote specialist agent	Usually wrapped as a tool call	Strong fit (first-class task handoff)
Long-running task status	Tool-specific custom polling logic	Built into the protocol (task objects + updates)
Cross-vendor interoperability	Tool and data format interoperability	Full agent-to-agent task delegation
Human-in-the-loop integration	Handled by the calling host	Can be modeled as explicit review tasks

In practice, many production systems use both layers together: MCP to give each agent detailed local tools and context, and A2A when one agent's planner needs to delegate work to a separate specialist agent running in another environment or organization. The references at the end link to the current protocol specs and community discussions.

AutoGen's conversation teams

Microsoft's AutoGen^{[1]Reference 1AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.https://arxiv.org/abs/2308.08155} popularized a conversation-based multi-agent pattern where agents communicate through messages. That design space overlaps with work like ChatDev^{[16]Reference 16Communicative Agents for Software Development.https://arxiv.org/abs/2307.07924}, where agents role-play as software developers in a collaborative environment. In current AgentChat releases, chat-style teams such as SelectorGroupChat, RoundRobinGroupChat, and Swarm coexist with directed workflows through GraphFlow.^{[17]Reference 17GraphFlow (Workflows)https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/graph-flow.html}

For these team abstractions, the main contract is still the shared conversation history plus team policies such as speaker selection and termination conditions. The snippet below uses SelectorGroupChat, which is the clearest example of message-oriented coordination.

This snippet requires autogen-agentchat, autogen-ext[openai], and OpenAI credentials. It uses GPT-5.6 Luna as a cost-sensitive selector model. AutoGen accepts explicit model_info when its installed release doesn't yet recognize a new OpenAI model ID, which keeps capability checks separate from framework release cadence.^{[18]Reference 18GPT-5.6 Luna Modelhttps://developers.openai.com/api/docs/models/gpt-5.6-luna}^{[19]Reference 19Model Clientshttps://microsoft.github.io/autogen/stable/user-guide/core-user-guide/components/model-clients.html} It constructs a team; running it calls the configured model.

autogens-conversation-teams.py

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.conditions import MaxMessageTermination, TextMentionTermination
from autogen_agentchat.teams import SelectorGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
    model="gpt-5.6-luna",
    model_info={
        "vision": True,
        "function_calling": True,
        "json_output": True,
        "family": "gpt-5",
        "structured_output": True,
    },
)

# Define specialized agents
researcher = AssistantAgent(
    "Researcher",
    description="Finds relevant facts and sources",
    model_client=model_client,
    system_message="You research topics using web search.",
)

analyst = AssistantAgent(
    "Analyst",
    description="Synthesizes research into a concise answer",
    model_client=model_client,
    system_message="You analyze research findings and end with TERMINATE when done.",
)

termination = TextMentionTermination("TERMINATE") | MaxMessageTermination(max_messages=12)

# SelectorGroupChat uses the model to pick the next speaker
team = SelectorGroupChat(
    [researcher, analyst],
    model_client=model_client,
    termination_condition=termination,
)

If you need explicit graph routing inside AutoGen instead of model-selected turn-taking, use GraphFlow rather than a selector-based team.

Picking the right framework

Choosing the right orchestration framework depends heavily on the use case. While all three frameworks support multi-agent patterns, they prioritize different approaches: LangGraph enforces strict graph-based execution, AutoGen spans conversation-centric teams and GraphFlow workflows, and MetaGPT structures agents around rigid software engineering roles.

Feature	LangGraph	AutoGen	MetaGPT
Control Flow	Explicit Graph / State Machine	Conversation Teams or directed `GraphFlow`	SOPs (Standard Operating Procedures)
State	Typed Shared State	Shared Conversation History, optionally constrained by flow edges	Shared Environment / Artifacts
Best For	Structured, engineering-heavy workflows	Conversation teams, handoffs, or graph workflows inside AgentChat	Software dev, strictly defined roles
Complexity	High (Write code for every node)	Medium to High (team presets plus GraphFlow)	High (Class-based definitions)

Recommendation

Choose LangGraph when you need strict control over routing, retries, checkpoints, and approvals. AutoGen fits conversation-centric team abstractions, especially when GraphFlow gives you enough structure without dropping to a lower-level state machine. MetaGPT fits rigid software roles and SOP-driven artifact generation.

Where these systems show up in production

Multi-agent orchestration is useful when it separates independent reads and keeps external writes behind one controlled execution boundary. These are three plausible production designs, not claims about deployed performance.

Model-release gate

A coordinator routes independent eval, canary-health, and rollback-readiness checks in parallel, then merges their source-backed findings. A permitted deployment service, not a free-form specialist, shifts traffic only after evidence and policy checks satisfy release rules. The graph can draft a release note or send an exception to review without giving every worker write authority.

Incident triage

When an alert fires, independent branches can retrieve recent deploys, error-budget burn, and runbook guidance scoped to the service and tenant. A reducer validates sources and produces a proposed action. If policy requires approval, the reviewer sees the exact action, evidence references, and action digest; an incident service publishes the update only after revalidation with an idempotency key.

Evaluation regression handling

When a nightly eval reports a regression, a triage node classifies the failure. Metric analysis can produce a rollback proposal, while a communication branch drafts the stakeholder update. Rolling back or disabling a route remains an authorized action bound to verified deployment state. After a bounded number of failed proposals, the graph escalates with evidence instead of continuing indefinitely.

What to check before moving on

Check that you can defend these design choices in a system review or interview:

Skill	What good looks like
When to go multi-agent	Argue, with the Cognition and Anthropic positions, why multi-agent fits separable parallel work but hurts tightly coupled tasks given the much larger token budget.
Chain limits	Explain why linear agent chains fail for complex workflows with conditional routing, parallel work, or cycles.
Graph shape	Distinguish DAG-style workflows from cyclic state-machine orchestration.
LangGraph implementation	Build a `StateGraph` with typed state, nodes, reducers, fixed edges, and conditional edges.
Team topology	Compare supervisor, hierarchical, swarm, and map-reduce patterns by control, latency, and debuggability.
Coordination contract	Explain typed shared state versus message-oriented coordination.
Protocol boundary	Distinguish MCP tool/context access from A2A remote-agent delegation.
Production control	Add checkpointing, termination guards, evidence validation, and approvals bound to exact risky actions.

Follow-up questions

How do you handle infinite loops in a cyclic agent graph? Once you allow retries or review loops, you're no longer dealing with a strict DAG. LangGraph can model cyclic graphs and state machines, but production systems need hard termination through recursion_limit, step_count, max_turns, max retries, deadlines, or a forced transition to failure handling or human review.

When would you choose a supervisor pattern over a flat swarm pattern? Choose a supervisor when you need centralized control, strict process adherence, policy insertion, or arbitration between workers with overlapping capabilities. A swarm or peer-to-peer handoff pattern fits loosely coupled specialists, but it's harder to debug because routing decisions are decentralized.

Why is typed shared state often easier than message passing for structured workflows? It isn't universally better, but it gives downstream nodes reliable fields such as deployment_data, approval_status, policy_result, and error_count without asking them to infer state from a growing transcript. Message-oriented teams can work well, but they need stronger serialization, summaries, and termination policies as history grows.

How does human-in-the-loop approval fit into graph architecture? Model it as an explicit approval node. In LangGraph, that node can call interrupt() to emit a structured review payload and pause execution through a checkpointer. The payload should contain the proposed action digest, evidence, source version, and idempotency key. After a human decision resumes the workflow with Command(resume=...), the execution node still revalidates authoritative state before writing.

Common pitfalls

Symptom	Cause	Fix
Multi-agent prototype is slower and worse than one good prompt.	Graph was split into too many tiny agents with no real parallelism or separation of concerns.	Collapse adjacent steps, keep one agent for tightly coupled reasoning, and add workers only where roles or tools truly differ.
Routing burns tokens before useful work starts.	Every handoff or branch decision uses a large model, even for rules you can write down.	Use deterministic routing for explicit rules, and reserve model-based routing for real language ambiguity.
Retry loop never exits cleanly.	Cyclic graph has no hard guard such as `recursion_limit`, retry budget, or deadline.	Add termination counters in state and force failure handling or human review when the limit is hit.
State grows until every node becomes expensive.	Large records, raw tool payloads, or full transcripts are copied into shared state.	Store big objects by ID, summarize long histories, and keep only fields downstream nodes read.
Downstream worker breaks after an upstream wording change.	Agents are handing off text instead of typed state fields or schema-constrained outputs.	Put structured contracts between workers: typed state keys, validated JSON, or reducer-backed message formats.
Approved promotion changes before execution.	Approval was recorded as a Boolean instead of being bound to an action digest and source version.	Revalidate the exact proposal at execution time and issue the write through an idempotency key.
Remote integrations are awkward and brittle.	MCP tool access and A2A delegation were treated as interchangeable.	Use MCP for local tool and context access, and A2A when one independent agent system must delegate to another.

Infinite loops and how to stop them

If you allow cyclic graphs (for example, "if the release lookup fails, try again with a different query"), you must enforce termination. Common guards include:

A recursion_limit in the graph invoke or stream config.
A step_count or error_count field in state that increments each retry.
A max_turns guard in conversation loops that forces a transition to failure handling or human review.

Without one of these, two agents can pass a task back and forth forever.

bound-cyclic-retries.py

def lookup_with_budget(statuses: list[str], max_attempts: int) -> str:
    for attempt, status in enumerate(statuses[:max_attempts], start=1):
        print(f"attempt {attempt}: {status}")
        if status == "verified":
            return "continue to formatter"
    return "escalate for review"

outcome = lookup_with_budget(["not_found", "timeout", "not_found"], max_attempts=2)
print("outcome:", outcome)

Output

attempt 1: not_found
attempt 2: timeout
outcome: escalate for review

Context bloat and how to avoid it

A frequent failure mode is passing the entire message history to every agent. If the conversation has twenty turns, every node sees all twenty even when it only needs the latest release ID. Fixes include:

Using a summary node that compresses long histories into a short context string before routing.
Storing large objects (release records, metric snapshots) by reference ID rather than embedding the full JSON in state.
Keeping the messages list for the transcript but adding dedicated state keys (like deployment_data) for data that downstream nodes need directly.

The state can expose a minimal verified projection while keeping sensitive record content behind the application authorization boundary:

keep-sensitive-records-out-of-state.py

release_store = {
    "releases/tenant-7/reranker-v17": {
        "owner_email": "[email protected]",
        "status": "canary_clean",
        "release_id": "rel-7421",
    }
}

def project_for_graph(release_ref: str, tenant_id: str) -> dict[str, str]:
    if not release_ref.startswith(f"releases/{tenant_id}/"):
        raise PermissionError("cross-tenant reference")
    record = release_store[release_ref]
    return {"release_ref": release_ref, "verified_status": record["status"]}

state_update = project_for_graph("releases/tenant-7/reranker-v17", "tenant-7")
print("state update:", state_update)
print("contains owner email:", "owner_email" in state_update)

Output

state update: {'release_ref': 'releases/tenant-7/reranker-v17', 'verified_status': 'canary_clean'}
contains owner email: False

Non-deterministic routing

Using an LLM to decide the next step when a simple if/else or regex would suffice increases cost and reduces reliability. Use deterministic routing for rules you can write down, and use LLM-based routing only when the classification genuinely requires language understanding.

Try it yourself

To check your understanding, try this small design exercise:

Scenario: An on-call engineer asks, "Can we promote reranker-v17 to 25% traffic, and did the canary burn error budget in the last hour?"

Task: Sketch a LangGraph StateGraph with at least three nodes and one conditional edge. Define the TypedDict state, name the nodes, and write the routing logic. Then answer these questions:

Which parts of this request can run in parallel, and which must run in sequence?
Where would you place an interrupt() if the traffic shift exceeds 10%?
What state keys would you need so the response formatter can compose a complete reply without copying the whole release record into graph state?
What must the deployment boundary verify before it executes an approved promotion?

Solution sketch

Parallel: Eval verification and canary-health lookup can run at the same time.
Sequence: The response formatter must run after both parallel branches finish, because it needs the release status and the policy explanation.
Interrupt: Place it after an action proposal contains the traffic percentage and evidence, but before any deployment write. Bind approval to an action digest and source version.
State keys: release_ref, minimal verified deployment fields, policy_result, operator_request, proposed_action, approval_status, and final_answer.
Execution check: Re-read authoritative deployment state, reject stale or changed proposals, and use an idempotency key for the traffic shift.

Orchestration admission rules

Multi-agent orchestration is a cost trade, not a default. Use it only when subtasks are separable and parallel and the task value justifies a much larger token budget. Anthropic reported about 15x chat-token usage for its Research system; don't treat that measurement as a universal multiplier. For tightly coupled work, a single agent with strong context engineering wins.
Explicit graphs are the primitive. DAGs handle one-way workflows, while state machines add controlled cycles for retries and approvals.
State is the API. In LangGraph, the State schema defines the contract between agents. Keep it strict, typed, tenant-scoped, and limited to necessary evidence or references.
Topology follows workflow shape. Use Supervisor for centralized routing, Swarm for peer-to-peer handoff, and Map-Reduce for parallel branches with a merge contract.
Use durable checkpointers for failures and approvals, and make replayed external writes idempotent.
Prefer deterministic structural nodes where rules are explicit; reserve model calls for branches that need language understanding.
MCP exposes tools and context into a runtime; A2A connects independent agent systems across vendor boundaries.

The Advanced Agents and Retrieval section ends with a concrete orchestration stack: specialized agents wired into explicit graphs, workflow-shaped topology, typed shared state, durable progress, bounded approvals for risky actions, and a clear admission check for when not to build a multi-agent system at all.

Next Step

Continue to Capstone: Production Agent

You can now choose between one bounded agent, recursive sub-calls, and coordinated workers. The capstone combines those architecture decisions with cited evidence, classifier admission, approval-gated actions, durable state, recovery, and trajectory release tests.

PreviousRecursive Language Models (RLM)

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.

Wu, Q., et al. · 2023

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.

Hong, S., et al. · 2023

Introducing Devin, the first AI software engineer

Cognition Labs · 2024

Don't Build Multi-Agents

Yan, W. (Cognition) · 2025

Multi-Agents: What's Actually Working

Yan, W. (Cognition) · 2026

How we built our multi-agent research system

Hadfield, J., Zhang, B., Lien, K., et al. (Anthropic) · 2025

LangGraph Graph API

LangChain · 2026

Swarm: Educational Framework for Multi-Agent Orchestration.

OpenAI · 2024

OpenAI Agents SDK

OpenAI · 2025

OpenAI Agents SDK Handoffs

OpenAI · 2026

LangGraph Persistence

LangChain · 2026

LangGraph Interrupts

LangChain · 2024

Introducing the Model Context Protocol

Anthropic · 2024

A2A Protocol Specification

A2A Project · 2025

A2A Protocol Ships v1.0: Production-Ready Standard for Agent-to-Agent Communication

A2A Protocol Community · 2026

Communicative Agents for Software Development.

Qian, C., et al. · 2023

GraphFlow (Workflows)

Microsoft AutoGen · 2026

GPT-5.6 Luna Model

OpenAI · 2026

Model Clients

Microsoft AutoGen · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Multi-Agent Orchestration

When is multi-agent orchestration justified instead of one stronger prompt?

The honest case: when multi-agent helps and when it hurts

Why is "multi-agent outperformed single-agent by 90%" not a reason to always use multi-agent?

Why linear agent chains break

The problem with linear chains

What a DAG gives you

Key property

What is the difference between a DAG and the graph you need for retries or approvals?

Building a workflow graph in LangGraph

State: the shared whiteboard

Why is the state schema the API between agents?

Nodes: the workers

Why should a node return only the fields it changes?

Edges: the routing logic

Why should the router function be deterministic?

What does the local LangGraph smoke test prove?

Four ways to organize a team

Supervisor agent

When it helps

When it hurts

What is the main trade-off in the supervisor pattern?

Hierarchical teams

When it helps

When it hurts

When is a hierarchy overkill?

Swarm (peer-to-peer handoff)

What is the difference between a handoff and using a specialist as a tool?

When it helps

When it hurts

Map-reduce agents

Why does map-reduce need reducers for shared state keys?

Saving progress when agents crash

Checkpointing and persistence

Why is thread_id more than a logging label?

Getting human approval before critical steps

Why should side effects usually happen after interrupt() rather than before it?

How agents share memory

When should you prefer typed shared state over message-only coordination?

Talking to agents outside your system

MCP: tools and context access

A2A: remote agent delegation

Why the distinction matters

A planner needs to ask another company's specialist agent to perform a long-running task. Is that MCP or A2A?

AutoGen's conversation teams

When should an AutoGen team move from SelectorGroupChat to GraphFlow?

Picking the right framework

Recommendation

What framework choice should you make for a model-promotion workflow with durable checkpoints and approval nodes?

Where these systems show up in production

Model-release gate

Incident triage

Evaluation regression handling

What to check before moving on

Follow-up questions

Common pitfalls

Infinite loops and how to stop them

What is the minimum production guard for any cyclic agent graph?

Context bloat and how to avoid it

Why should large release records be stored by reference instead of embedded in every message?

Non-deterministic routing

When should routing be deterministic instead of model-based?

Try it yourself

Solution sketch

Why must the response formatter wait for both eval verification and policy lookup?

Orchestration admission rules

Mastery Check

Discussion