How Computer-Use Agents See and Act
In the last lesson we met the perceive, decide, act loop. Now we open the hood. If you understand how a computer-use agent takes in a page and chooses what to do, you can predict almost perfectly where it will succeed and where it will fall on its face. That intuition is worth more than memorizing any single tool's menu.
You do not need to write a line of code for this lesson. The goal is a clear mental model.
What You'll Learn
- How an agent "sees" a page: pixels, page text, and structure
- Why the same loop that makes agents flexible also makes them slow and fragile
- The role of checkpoints and human approval
- Practical signs a task is a good or bad fit for an agent
Two Ways to "See" a Page
A web page exists in two forms at once, and agents can use either or both:
- The visual form (pixels). Literally a screenshot of what you see. A vision model looks at the image and reasons about it: "there is a search box near the top, a blue button labeled Search to its right."
- The structural form (the page's underlying text and elements). The same page also exists as structured markup that names each element: this is a link, this is a text field, this is a button with the label "Search."
Screenshot-based perception is the most general because it works even when the underlying structure is messy or deliberately hidden, which is why Anthropic's and OpenAI's computer-use tools lean on it. Reading the structured page text is faster and more precise when it is available. Many AI browsers blend the two. The important takeaway for you as a user: the agent is interpreting the page, not magically "knowing" it, and interpretation can be wrong.
The Loop, Step by Step
Let us walk one real cycle for the instruction "add the cheapest 1TB SSD to my cart."
- PerceiveScreenshot + page text of the results
- ReasonWhich row is cheapest 1TB?
- Plan actionClick that product
- ActMove cursor, click
- ObserveNew page loaded?
- RepeatUntil in cart
Notice that the agent commits to one action at a time and then looks again. It does not have a guaranteed script; it is improvising each step from what it currently sees. This is why:
- It is slow. Every step is a full perceive-reason-act round trip, often several seconds each.
- Small changes derail it. A cookie banner, a pop-up, a relocated button, or a page that loads slowly can make the agent misjudge the next click.
- Errors compound. A wrong click early can send the agent down a path it never fully recovers from, because each step assumes the last one worked.
None of this means agents are useless. It means they are best on tasks where the pages are reasonably clean and where an occasional wrong turn is cheap to catch.
Why Agents Pause: Checkpoints and Approval
Because acting on a live, logged-in browser is consequential, well-designed agents build in checkpoints, where they stop and ask you before doing something sensitive. This is not a limitation to work around; it is the safety model.
For example, OpenAI has said agent mode in its Atlas browser will pause and make sure you are watching before it acts on especially sensitive sites such as financial institutions, and Google has described Chrome's agentic "Auto Browse" as requiring user approval for sensitive steps like purchases. The pattern across the industry is the same: the agent handles the tedious middle of a task and hands the risky decisions back to you.
You will get the most out of these tools by treating checkpoints as your job, not the agent's inconvenience. When it pauses, actually read what it is about to do.
Guardrails Built Into the Sandbox
Beyond pausing, browser agents run inside deliberate limits. Using Atlas's agent mode as a documented example, OpenAI states it cannot run code in the browser, cannot download files, cannot install extensions, and cannot reach other apps on your computer or your file system. Those walls exist precisely because the perceive-act loop can be fooled, a theme we develop fully in the security lesson.
A useful way to picture it: the agent is a capable temp worker you have given a very specific, narrow desk. It can do a lot at that desk, but it physically cannot walk into other rooms.
Is My Task a Good Fit? A Quick Test
Before you hand a task to an agent, run it through this decision:
Decision
What kind of task is it?
- If Reading, summarizing, comparing pages
Great fit. Use assistant mode, low risk.
No clicking required
- If Repetitive, structured, low-stakes clicking
Good fit for agent mode, but supervise.
e.g. gathering listings, filling a known form
- If High-stakes or irreversible (payments, sending messages, legal)
Only with careful human approval at each step.
Never fully hands-off
- If Needs a login you would not want exposed
Reconsider. See the security lesson first.
Prompt-injection risk
The pattern: the more mechanical and reversible a task is, the better an agent handles it. The more judgment or consequence it carries, the more you stay in the driver's seat.
Key Takeaways
- Agents perceive pages as pixels (screenshots), structured page text, or a blend, and they are always interpreting, which can go wrong.
- The one-action-at-a-time loop makes agents flexible but slow, fragile to page changes, and prone to compounding errors.
- Checkpoints and approval prompts are the core safety model; reading them is your responsibility, not a nuisance.
- Browser agents run in a sandbox with hard limits (no downloading, no file-system access, no other apps) because the loop can be fooled.
- Match the task to the tool: summarizing is a great fit, irreversible high-stakes actions demand step-by-step human approval.

