Build browser and desktop agents whose proposed clicks and keystrokes remain behind host policy, approval, verification, and sandbox controls.
Code-generation agents need sandboxes because generated commands can affect files, networks, and secrets. Computer-use agents add a different risk surface: they can propose interactions with a visible browser or desktop. Extend sandboxing discipline to CI dashboards, internal applications, and workflows where a model sees pixels and proposes mouse or keyboard actions.
You don't need to study image encoders in detail before starting. The practical requirement is simpler: understand that the agent grounds actions from screenshots, text, and accessibility metadata, then let the host wrap that perception loop in policy and sandbox controls.
An infrastructure team receives an incident: "Why did deploy run RUN-78432 fail after the test stage?" Assume the legacy CI dashboard exposes the failing step, screenshot, and redacted log excerpt in a browser portal but doesn't provide a supported API for that evidence. A read-only agent pilot could open the permitted portal session, locate the failing job and traceback, and prepare evidence for a reviewer.
If stable API or selector-based automation works, use it first. Computer use becomes relevant when an approved workflow depends on visual evidence or a legacy interface that deterministic automation can't reliably address. Even then, the model should propose steps while a host process controls credentials, action policy, and any record mutation.
Computer-use tools let models receive screenshots, and sometimes supporting browser state or accessibility metadata, then propose input events such as moving a virtual mouse, clicking, typing a run ID, scrolling, or requesting another screenshot.[1] The host decides whether and where a proposed event executes.
This interface can cover permitted legacy tools or visual portal steps with no suitable API. It also creates risk: a mis-grounded click, an overbroad session, or a leaked credential can change external state. Treat computer use as delegated access, not as a convenience wrapper around screenshots.
Ordinary function calling gives a model a curated set of typed operations that the developer can separately authorize. Typed doesn't automatically mean safe, but it creates a narrow action surface. Computer use exposes a broader action surface: from a screenshot, sometimes augmented with an accessibility tree or optical character recognition (OCR) output, the model proposes primitive input events.
The shift changes the threat model. A function-calling agent can only call the tools you exposed. A computer-use agent can, in principle, open any application visible on the virtual desktop, click any pixel, and type into any field. That reach makes the pattern useful for the long tail of workflows that have no API, and explains why every production deployment must enforce a permission boundary with multiple independent layers of control.
Computer-use agents follow a sense-plan-act cycle. The host captures the screen, the model proposes the next low-level action, the host validates and possibly executes that action, and the next observation returns to the model.
The loop repeats until the host verifies completion, a limit expires, or a safety boundary intervenes. Each additional observation and model decision consumes latency and model usage, which is why bounded trajectories matter.
The main illustration below shows the architecture for the running CI incident example. An incident enters at the top, then the agent captures an observation, the model proposes an action, and the host validates and executes only permitted steps inside a restricted session. Policy, approval, audit evidence, and the runtime boundary surround the loop.
The CI dashboard runs inside a restricted browser session because the task depends on visual evidence rather than an available API. A policy engine sits between the model's proposal and execution. Any proposed incident update or deploy action is routed to a reviewer with the relevant observation and intended change. Audit records should capture policy evidence without copying secrets or more log data than retention policy permits.
A host can make that split concrete before integrating a model. In this miniature policy, reading a permitted CI page can execute, while mutating incident state pauses for approval.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Action:
5 domain: str
6 kind: str
7 target: str
8
9ALLOWED_DOMAINS = {"ci.example", "incidents.example"}
10WRITE_ACTIONS = {"update_incident", "rerun_deploy", "submit_form"}
11
12def route(action: Action) -> str:
13 if action.domain not in ALLOWED_DOMAINS:
14 return "blocked: domain outside task scope"
15 if action.kind in WRITE_ACTIONS:
16 return "pause: reviewer approval required"
17 return "execute read-only step"
18
19for proposal in [
20 Action("ci.example", "view_log", "RUN-78432"),
21 Action("incidents.example", "update_incident", "INC-2048"),
22 Action("mail.example", "open", "inbox"),
23]:
24 print(route(proposal))1execute read-only step
2pause: reviewer approval required
3blocked: domain outside task scopeThe same pattern becomes easier to reason about when you strip away the story details and look at the reusable production architecture.
To make the loop tangible, walk through the CI failure inspection task step by step using proposed GUI actions.
Step 1. The host opens a fresh restricted browser profile and establishes an authorized CI session through a credential broker. The model doesn't receive the password or propose typing it. Once authenticated, the host captures the permitted dashboard view and relevant grounding metadata.
Step 2. After the login succeeds, the host captures the next screenshot showing the run search form. The model outputs type("RUN-78432") into the run field and a left_click on the search button.
Step 3. The results page loads with failing test output visible. After seeing the log region, the model outputs a mouse_move over the failed step, a left_click to expand it, and a scroll action to bring the traceback into view. The host captures the new screenshot and returns it.
Step 4. The model reads visible evidence and proposes preparing a draft incident note with the failing command, traceback line, and artifact link. The host policy permits extracting that evidence but doesn't allow the model to rerun deploys or edit incident severity from the page alone.
Step 5. The host presents the gathered evidence to a reviewer. If the reviewer approves a specified incident update, a separately authorized write records the note and status. The host verifies the intended incident changed and that unrelated incidents did not change before tearing down the session.
At each step the policy engine inspected the proposal. Navigation and evidence collection stayed read-only; the incident write required approval. The audit record stores action proposals, policy outcomes, relevant observation identifiers, approval, and final verification, with secrets and unnecessary raw log content excluded.
Don't represent authentication as text the model should type. Represent it as a host-owned setup step that returns only a session outcome.
1def establish_session(task: str, credential_handle: str) -> dict[str, str]:
2 assert credential_handle.startswith("broker://")
3 return {"task": task, "session": "authenticated", "visible_secret": "none"}
4
5model_context = {"task": "inspect CI failure evidence", "run_id": "RUN-78432"}
6session = establish_session(model_context["task"], "broker://ci-readonly")
7
8print("model fields:", sorted(model_context))
9print("session:", session["session"])
10print("password exposed:", session["visible_secret"] != "none")1model fields: ['run_id', 'task']
2session: authenticated
3password exposed: FalseThe hardest part of the observe step is turning raw pixels into reliable intent. A screenshot is only a grid of colored pixels. The model must map regions of that grid to UI elements, understand which element corresponds to "the failed test details for this run," and decide the next physical action that will advance the goal.
Pure coordinate-based clicking is extremely brittle. If the CI dashboard runs an A/B test that moves the "Failed tests" tab twenty pixels down, or if a different viewport collapses the sidebar, or if a responsive layout kicks in on a slightly narrower window, the coordinates the model learned on the previous run become invalid. The model then clicks empty space and the loop stalls.
That's why a browser-agent host can augment screenshots with an accessibility tree, OCR, or element metadata. The model can then refer to role, name, or visible text instead of only fragile (x, y) coordinates. These signals improve grounding, but page content is still untrusted and can't authorize a side effect.
Grounding metadata is useful only when it selects one target. If two buttons share the same label, the host should request a new observation or human help instead of clicking a guess.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Element:
5 role: str
6 name: str
7 element_id: str
8
9def resolve(elements: list[Element], role: str, name: str) -> str:
10 matches = [element.element_id for element in elements if element.role == role and element.name == name]
11 if len(matches) != 1:
12 return f"blocked: expected one target, found {len(matches)}"
13 return f"click element {matches[0]}"
14
15page = [
16 Element("button", "Failed tests", "failures-1"),
17 Element("button", "Failed tests", "failures-2"),
18]
19print(resolve(page[:1], "button", "Failed tests"))
20print(resolve(page, "button", "Failed tests"))1click element failures-1
2blocked: expected one target, found 2Anthropic documents computer use as a versioned tool in the Messages API. Its documentation defines tool versions, supported model pairings, an agent loop, and actions such as mouse movement, clicking, typing, key presses, scrolling, waiting, and requesting screenshots.[2] Check its current compatibility table when implementing rather than hard-coding a tool identifier into architectural guidance.
Your client code is responsible for executing an allowed primitive inside the controlled environment, capturing the next observation, and returning the result in the next tool_result message.
OpenAI documents computer use through the Responses API. Its guide shows computer_call items that contain proposed actions, and the host answers with computer_call_output plus the next screenshot. The host still owns browser isolation, step execution, and confirmation before risky actions.[3]
Google documents computer use through Gemini API. The model analyzes the user request plus screenshots, returns UI-action function calls, and can attach a safety_decision; your client still decides whether to execute that step, captures the result, and sends the next observation back.[4]
Across all three provider patterns, the developer owns the execution environment. The model proposes actions; application code enforces boundaries and decides when to stop or escalate to a human.
Many real production workloads stay inside the browser. When that's true, you have lighter options.
Use Playwright or Selenium directly when an approved target exposes stable selectors and deterministic automation satisfies the workflow. Browser-agent harnesses and evaluation environments can add observation/action adapters for model-driven experiments. Prefer the narrowest control surface that meets the task rather than moving to pixel-level control by default.
Consider full computer-use primitives only when authorized tasks require visual interpretation, browser UI interaction, or a native desktop application and no narrower supported integration is suitable. Don't use computer use to evade a site's access controls or anti-bot policy.
Full desktop computer use, in the style of OSWorld, becomes necessary when the job spans a browser window plus a thick client that has no web equivalent. The sandbox requirements grow because you must now present a believable desktop to both the browser and the native application while still keeping every input observable and revocable by the host.
The key engineering decision isn't only the prompt you give the model. It's the environment in which approved actions run.
For a privileged engineering workflow, begin each task in a fresh restricted browser profile or disposable runtime appropriate to the threat model. Expose only the applications and domains needed for the task. Don't expose personal email, password managers, general-purpose terminals, or package installation to a dashboard-inspection session.
Route network traffic through policy that permits only task-required domains. Establish credentials host-side rather than placing passwords in model-visible text. An action validator should inspect each proposed target, action class, active domain, and requested side effect before execution.
High-risk actions, including deploy reruns, production mutations, or incident-record writes, require explicit human approval before execution. Show the reviewer the relevant redacted observation and intended change. For audits and debugging, retain action proposals, policy outcomes, approval decisions, and verifier evidence according to a retention policy; don't assume storing every raw screenshot or keystroke is appropriate.
The session manifest should be inspectable before the browser starts. This example admits a read-only CI lookup and rejects a session that requests an unrelated domain or write capability.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class SessionManifest:
5 domains: tuple[str, ...]
6 allowed_effects: tuple[str, ...]
7 browser_profile: str
8
9TASK_DOMAINS = {"ci.example", "incidents.example"}
10READ_ONLY_EFFECTS = {"browse_url", "read_visible_text", "capture_reference"}
11
12def admit(manifest: SessionManifest) -> str:
13 if manifest.browser_profile != "fresh":
14 return "blocked: reused browser profile"
15 if not set(manifest.domains) <= TASK_DOMAINS:
16 return "blocked: unrelated domain"
17 if not set(manifest.allowed_effects) <= READ_ONLY_EFFECTS:
18 return "blocked: write capability requested"
19 return "admitted: read-only session"
20
21safe = SessionManifest(("ci.example",), ("browse_url", "capture_reference"), "fresh")
22reused_profile = SessionManifest(("ci.example",), ("browse_url",), "reused")
23unrelated_domain = SessionManifest(("ci.example", "mail.example"), ("browse_url",), "fresh")
24write_capability = SessionManifest(("ci.example",), ("rerun_deploy",), "fresh")
25print(admit(safe))
26print(admit(reused_profile))
27print(admit(unrelated_domain))
28print(admit(write_capability))1admitted: read-only session
2blocked: reused browser profile
3blocked: unrelated domain
4blocked: write capability requestedAudit data must also be minimized. The log needs evidence of the policy decision, not copied private email addresses or secret fields.
1import re
2
3def audit_event(note: str, action: str, decision: str) -> dict[str, str]:
4 redacted = re.sub(r"[\w.+-]+@[\w.-]+", "[EMAIL]", note)
5 redacted = re.sub(r"password=[^\s]+", "password=[REDACTED]", redacted)
6 return {"action": action, "decision": decision, "note": redacted}
7
8event = audit_event(
9 "notify [email protected] password=portal-secret",
10 action="capture_reference",
11 decision="allowed_read_only",
12)
13print(event["note"])
14print("secret logged:", "portal-secret" in str(event))1notify [EMAIL] password=[REDACTED]
2secret logged: FalseCommon pitfalls include reusing a browser profile across unrelated tasks, letting the model alone decide when the task is finished, and running write-enabled sessions without rate limits or a kill switch.
GUI-agent isolation adds browser-profile state, visible credential surfaces, timing-dependent interfaces, and clicks that can mutate external application state. Those risks require controls beyond filesystem and process limits: domain policy, credential brokering, action gating, final-state verification, and cleanup of session state.
Final-task accuracy alone isn't enough. Evaluate both functional success and operational safety.
Task success rate asks whether the final observable state of the system matches the goal. WebArena and OSWorld supply programmatic validators for this. Step efficiency measures how many actions and model calls were required; long trajectories burn tokens and raise the probability of eventual failure. Safety violation rate counts how often the agent attempted a blocked action or required human intervention. Grounding accuracy compares proposed click coordinates against ground-truth element locations. For open-ended tasks, blind human raters compare two agent trajectories on the same goal and express preference.
Useful evaluation suites include:
Public benchmark numbers change as systems and harnesses change. Treat WebArena, OSWorld, and related suites as regression harnesses and failure-mode maps, not as a full production readiness certificate. A score under a documented harness doesn't determine whether your CI dashboard, incident workflow, and reviewer queue satisfy policy.
Computer-use agents fail in characteristic ways that pure API agents rarely encounter.
UI drift occurs when an A/B test, font change, or responsive layout moves an element by forty pixels; the model clicks empty space. Visual ambiguity appears when two buttons look identical at the screenshot resolution the model received; it chooses the wrong one because lighting or compression hid the label. Timing and animation problems cause clicks before a menu finishes expanding or before a loading spinner disappears. Captcha and anti-bot systems are deliberately designed to defeat this class of agent and should route to human handling, not automated bypass. Credential leakage happens when the model is asked to log in and types a real password into a visible field that ends up in the audit log. Scope creep occurs when the model decides "while I am here I should also clean up the user's profile" and edits fields that were never part of the original task.
Indirect prompt injection is a central web-agent failure. A computer-use agent reads screen content as input, and a page can contain attacker-written text or images that instruct it to take an unrelated action. AgentDojo studies prompt-injection attacks against tool-using agents.[9] Provider guidance likewise places responsibility on the host: Anthropic recommends isolated environments and confirmation for consequential actions, OpenAI documents confirmation and isolation practices, and Gemini can return a per-step safety_decision requiring confirmation.[2][3][4] Treat page content as untrusted evidence, not authority for writes or production mutations.
A policy can carry the taint of an untrusted observation into the next proposed action. Here, seeing a suspicious dashboard banner doesn't prevent continued reading, but it blocks a deploy-rerun proposal until a reviewer approves it.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Observation:
5 contains_untrusted_instruction: bool
6
7@dataclass(frozen=True)
8class Proposal:
9 action: str
10 mutates_production: bool = False
11
12def authorize(observation: Observation, proposal: Proposal) -> str:
13 if observation.contains_untrusted_instruction and proposal.mutates_production:
14 return "blocked: untrusted page cannot authorize deploy action"
15 return "allowed: read-only continuation"
16
17banner = Observation(contains_untrusted_instruction=True)
18print(authorize(banner, Proposal("read_failed_step")))
19print(authorize(banner, Proposal("rerun_deploy", mutates_production=True)))1allowed: read-only continuation
2blocked: untrusted page cannot authorize deploy actionTreat each computer-use run as a security-sensitive operation. Retain enough redacted evidence to explain policy outcomes and support incident response, and maintain an operator stop control for active sessions.
A useful production pattern is hybrid, not pure computer use.
Use ordinary tool calling, Model Context Protocol (MCP) servers, direct APIs, or deterministic browser automation whenever they satisfy the authorized workflow. Invoke computer use only for the narrow portion that requires visual reasoning or legacy GUI access. Bound sessions by step count and wall-clock time. Require approval for actions whose blast radius includes secrets, incident state, or production infrastructure.
Use the same engineering discipline as for AI coding agents. Speed and generality matter only when they remain inside explicit boundaries and human accountability.
Make that escalation rule explicit in routing rather than leaving it to a model's preference.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Workflow:
5 name: str
6 api_available: bool
7 stable_selector_flow: bool
8 visual_evidence_required: bool
9
10def control_surface(workflow: Workflow) -> str:
11 if workflow.api_available:
12 return "direct API"
13 if workflow.stable_selector_flow and not workflow.visual_evidence_required:
14 return "deterministic browser automation"
15 return "computer use with host policy"
16
17workflows = [
18 Workflow("fetch run status", True, False, False),
19 Workflow("download known artifact", False, True, False),
20 Workflow("inspect failed test screenshot", False, False, True),
21]
22for workflow in workflows:
23 print(workflow.name, "->", control_surface(workflow))1fetch run status -> direct API
2download known artifact -> deterministic browser automation
3inspect failed test screenshot -> computer use with host policyStart with the control logic before connecting a real browser driver. The model's proposals are fixture data; the host policy refuses an unapproved write and accepts completion only after observable state matches the read-only goal.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Proposal:
5 action: str
6 writes_record: bool = False
7
8proposals = [
9 Proposal("open_run_page"),
10 Proposal("capture_traceback_reference"),
11 Proposal("update_incident_status", writes_record=True),
12]
13evidence_collected = False
14
15for step, proposal in enumerate(proposals, start=1):
16 if proposal.writes_record:
17 print(f"step {step}: paused for approval before {proposal.action}")
18 break
19 print(f"step {step}: executed {proposal.action}")
20 evidence_collected = evidence_collected or proposal.action == "capture_traceback_reference"
21
22print("read-only goal verified:", evidence_collected)
23print("session teardown: requested")1step 1: executed open_run_page
2step 2: executed capture_traceback_reference
3step 3: paused for approval before update_incident_status
4read-only goal verified: True
5session teardown: requestedThe host, not the model, owns the stop condition, the policy checks, the human gate, the audit log, and the final verification.
In the CI dashboard and incident-management example, not every action carries the same blast radius. A production policy engine classifies proposed actions into risk tiers before any execution happens.
| Risk Tier | Example Actions in CI Dashboard | Example Actions in Incident UI | Human Gate? | Typical Policy Response |
|---|---|---|---|---|
| Low | View run page, scroll, read failing-step metadata | View incident record, read prior notes | No | Execute immediately, log only |
| Medium | Expand log panel, copy artifact reference | Stage a draft note without submitting it | Policy-dependent | Permit only if retention and data policy allow it |
| High | Download full logs, open deploy controls, rerun production job | Submit note, update severity, page an owner, close incident | Yes | Pause, present redacted observation + intended change, wait for approval |
The table makes the defense-in-depth concrete. Low-risk navigation and reading stay fast. High-risk production or incident mutation steps pause for a human who can see the model's observation and the proposed state change.
This tiny policy router makes that escalation rule concrete:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class ActionProposal:
5 target: str
6 kind: str
7 writes_incident: bool = False
8 mutates_production: bool = False
9
10def classify(proposal: ActionProposal) -> tuple[str, bool]:
11 if proposal.mutates_production or proposal.writes_incident or proposal.target == "deploy_controls":
12 return "high", True
13 if proposal.kind in {"expand_log", "copy_reference", "prepare_note"}:
14 return "medium", False
15 return "low", False
16
17samples = [
18 ActionProposal(target="run_page", kind="scroll"),
19 ActionProposal(target="log_panel", kind="copy_reference"),
20 ActionProposal(target="deploy_controls", kind="submit", mutates_production=True),
21]
22
23for sample in samples:
24 tier, needs_human = classify(sample)
25 print(f"{sample.target}: tier={tier}, human_gate={needs_human}")1run_page: tier=low, human_gate=False
2log_panel: tier=medium, human_gate=False
3deploy_controls: tier=high, human_gate=TrueApproval is necessary but not sufficient. After an approved write, the host must check both the intended record and the absence of unrelated changes.
1before = {
2 "INC-2048": "traceback_collected",
3 "INC-9911": "open",
4}
5expected_after = {
6 "INC-2048": "review_approved",
7 "INC-9911": "open",
8}
9
10def verify(after: dict[str, str]) -> str:
11 return "verified" if after == expected_after else "blocked: unexpected final state"
12
13approved_write = {"INC-2048": "review_approved", "INC-9911": "open"}
14wrong_scope = {"INC-2048": "review_approved", "INC-9911": "closed"}
15
16print("approved write:", verify(approved_write))
17print("wrong scope:", verify(wrong_scope))1approved write: verified
2wrong scope: blocked: unexpected final stateA direct API request can often answer a structured lookup in one operation. A computer-use trajectory may require multiple observations, model calls, action executions, retries, and possibly human review. Measure these costs on your own workload and selected model rather than carrying over generic per-task estimates.
Reserve computer use for cases that require it, such as permitted visual evidence inspection or a legacy CI dashboard without a supported integration. Route structured lookups and authorized writes through narrower typed surfaces when available.
Begin with a controlled portal workflow and limit initial scope to read-only evidence lookup. Expand to an approved write only after the team defines acceptance criteria, measures failures, red-teams page injection, and confirms reviewer and rollback procedures.
Run each session in a fresh restricted environment appropriate to its risk. Retain redacted action evidence, policy outcomes, and final host-verification results according to privacy and incident-response requirements. Review failures with the owning engineering team before expanding the action space or adding another workflow.
When a browser driver supplies observations inside a restricted runtime, keep viewport size, device scale, fonts, and browser extensions deterministic for the evaluation or workflow configuration. Don't rely on a generic networkidle event as proof that a dynamic application is ready; wait for task-specific visible state and re-observe after each action. Layout drift can otherwise move a target between proposal and execution.
A verified computer-use session can be useful evaluation or improvement data when retention, privacy, and authorization permit it. Store only the data required for the stated purpose, with approval decisions and final-verifier results attached; a completed session isn't automatically a representative training demonstration.
Your platform team needs an agent that can open the legacy CI dashboard for a given run ID, locate the failed test step, capture the relevant traceback and artifact reference, update the incident note, and route the investigation to the owning team.
The CI dashboard has no API and lacks reliable selector-based automation.
Complete these tasks:
Write the policy and checklist as documents another engineer could hand to an incident reviewer on day one.
(x, y) coordinates without accessibility, OCR, or element signals.Computer-use agents propose actions on visible interfaces that can expose privileged state. Treat that ability like delegated access: least privilege, redacted evidence, restricted sessions, host verification, and approval for irreversible effects.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Introducing computer use, a new Claude 3.5 Sonnet capability
Anthropic · 2024
Computer Use Tool
Anthropic · 2026
Computer Use
OpenAI · 2026
Computer Use
Google · 2026
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., et al. · 2023 · arXiv preprint
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Xie, T., et al. · 2024 · NeurIPS 2024
Mind2Web: Towards a Generalist Agent for the Web
Deng, X., et al. · 2023 · arXiv preprint
GAIA: A Benchmark for General AI Assistants
Mialon, G., et al. · 2023 · arXiv preprint
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Debenedetti, E., et al. · 2024 · arXiv preprint