How Computer-Use Agents See and Act

In the last lesson we met the perceive, decide, act loop. Now we open the hood. If you understand how a computer-use agent takes in a page and chooses what to do, you can predict almost perfectly where it will succeed and where it will fall on its face. That intuition is worth more than memorizing any single tool's menu.

You do not need to write a line of code for this lesson. The goal is a clear mental model.

What You'll Learn

How an agent "sees" a page: pixels, page text, and structure
Why the same loop that makes agents flexible also makes them slow and fragile
The role of checkpoints and human approval
Practical signs a task is a good or bad fit for an agent

Two Ways to "See" a Page

A web page exists in two forms at once, and agents can use either or both:

The visual form (pixels). Literally a screenshot of what you see. A vision model looks at the image and reasons about it: "there is a search box near the top, a blue button labeled Search to its right."
The structural form (the page's underlying text and elements). The same page also exists as structured markup that names each element: this is a link, this is a text field, this is a button with the label "Search."

Screenshot-based perception is the most general because it works even when the underlying structure is messy or deliberately hidden, which is why Anthropic's and OpenAI's computer-use tools lean on it. Reading the structured page text is faster and more precise when it is available. Many AI browsers blend the two. The important takeaway for you as a user: the agent is interpreting the page, not magically "knowing" it, and interpretation can be wrong.

The Loop, Step by Step

Let us walk one real cycle for the instruction "add the cheapest 1TB SSD to my cart."

PerceiveScreenshot + page text of the results
ReasonWhich row is cheapest 1TB?
Plan actionClick that product
ActMove cursor, click
ObserveNew page loaded?
RepeatUntil in cart

Notice that the agent commits to one action at a time and then looks again. It does not have a guaranteed script; it is improvising each step from what it currently sees. This is why:

It is slow. Every step is a full perceive-reason-act round trip, often several seconds each.
Small changes derail it. A cookie banner, a pop-up, a relocated button, or a page that loads slowly can make the agent misjudge the next click.
Errors compound. A wrong click early can send the agent down a path it never fully recovers from, because each step assumes the last one worked.

None of this means agents are useless. It means they are best on tasks where the pages are reasonably clean and where an occasional wrong turn is cheap to catch.

Why Agents Pause: Checkpoints and Approval

Because acting on a live, logged-in browser is consequential, well-designed agents build in checkpoints, where they stop and ask you before doing something sensitive. This is not a limitation to work around; it is the safety model.

For example, OpenAI has said agent mode in its Atlas browser will pause and make sure you are watching before it acts on especially sensitive sites such as financial institutions, and Google has described Chrome's agentic "Auto Browse" as requiring user approval for sensitive steps like purchases. The pattern across the industry is the same: the agent handles the tedious middle of a task and hands the risky decisions back to you.

You will get the most out of these tools by treating checkpoints as your job, not the agent's inconvenience. When it pauses, actually read what it is about to do.

Guardrails Built Into the Sandbox

Beyond pausing, browser agents run inside deliberate limits. Using Atlas's agent mode as a documented example, OpenAI states it cannot run code in the browser, cannot download files, cannot install extensions, and cannot reach other apps on your computer or your file system. Those walls exist precisely because the perceive-act loop can be fooled, a theme we develop fully in the security lesson.

A useful way to picture it: the agent is a capable temp worker you have given a very specific, narrow desk. It can do a lot at that desk, but it physically cannot walk into other rooms.

Is My Task a Good Fit? A Quick Test

Before you hand a task to an agent, run it through this decision:

Decision

What kind of task is it?

If Reading, summarizing, comparing pages
Great fit. Use assistant mode, low risk.
No clicking required
If Repetitive, structured, low-stakes clicking
Good fit for agent mode, but supervise.
e.g. gathering listings, filling a known form
If High-stakes or irreversible (payments, sending messages, legal)
Only with careful human approval at each step.
Never fully hands-off
If Needs a login you would not want exposed
Reconsider. See the security lesson first.
Prompt-injection risk

The pattern: the more mechanical and reversible a task is, the better an agent handles it. The more judgment or consequence it carries, the more you stay in the driver's seat.

Key Takeaways

Agents perceive pages as pixels (screenshots), structured page text, or a blend, and they are always interpreting, which can go wrong.
The one-action-at-a-time loop makes agents flexible but slow, fragile to page changes, and prone to compounding errors.
Checkpoints and approval prompts are the core safety model; reading them is your responsibility, not a nuisance.
Browser agents run in a sandbox with hard limits (no downloading, no file-system access, no other apps) because the loop can be fooled.
Match the task to the tool: summarizing is a great fit, irreversible high-stakes actions demand step-by-step human approval.

How Computer-Use Agents See and Act

You do not need to write a line of code for this lesson. The goal is a clear mental model.

What You'll Learn

How an agent "sees" a page: pixels, page text, and structure
Why the same loop that makes agents flexible also makes them slow and fragile
The role of checkpoints and human approval
Practical signs a task is a good or bad fit for an agent

Two Ways to "See" a Page

A web page exists in two forms at once, and agents can use either or both:

The visual form (pixels). Literally a screenshot of what you see. A vision model looks at the image and reasons about it: "there is a search box near the top, a blue button labeled Search to its right."
The structural form (the page's underlying text and elements). The same page also exists as structured markup that names each element: this is a link, this is a text field, this is a button with the label "Search."

The Loop, Step by Step

Let us walk one real cycle for the instruction "add the cheapest 1TB SSD to my cart."

PerceiveScreenshot + page text of the results
ReasonWhich row is cheapest 1TB?
Plan actionClick that product
ActMove cursor, click
ObserveNew page loaded?
RepeatUntil in cart

Notice that the agent commits to one action at a time and then looks again. It does not have a guaranteed script; it is improvising each step from what it currently sees. This is why:

It is slow. Every step is a full perceive-reason-act round trip, often several seconds each.
Small changes derail it. A cookie banner, a pop-up, a relocated button, or a page that loads slowly can make the agent misjudge the next click.
Errors compound. A wrong click early can send the agent down a path it never fully recovers from, because each step assumes the last one worked.

None of this means agents are useless. It means they are best on tasks where the pages are reasonably clean and where an occasional wrong turn is cheap to catch.

Why Agents Pause: Checkpoints and Approval

You will get the most out of these tools by treating checkpoints as your job, not the agent's inconvenience. When it pauses, actually read what it is about to do.

Guardrails Built Into the Sandbox

A useful way to picture it: the agent is a capable temp worker you have given a very specific, narrow desk. It can do a lot at that desk, but it physically cannot walk into other rooms.

Is My Task a Good Fit? A Quick Test

Before you hand a task to an agent, run it through this decision:

Decision

What kind of task is it?

If Reading, summarizing, comparing pages
Great fit. Use assistant mode, low risk.
No clicking required
If Repetitive, structured, low-stakes clicking
Good fit for agent mode, but supervise.
e.g. gathering listings, filling a known form
If High-stakes or irreversible (payments, sending messages, legal)
Only with careful human approval at each step.
Never fully hands-off
If Needs a login you would not want exposed
Reconsider. See the security lesson first.
Prompt-injection risk

The pattern: the more mechanical and reversible a task is, the better an agent handles it. The more judgment or consequence it carries, the more you stay in the driver's seat.

Key Takeaways

Agents perceive pages as pixels (screenshots), structured page text, or a blend, and they are always interpreting, which can go wrong.
The one-action-at-a-time loop makes agents flexible but slow, fragile to page changes, and prone to compounding errors.
Checkpoints and approval prompts are the core safety model; reading them is your responsibility, not a nuisance.
Browser agents run in a sandbox with hard limits (no downloading, no file-system access, no other apps) because the loop can be fooled.
Match the task to the tool: summarizing is a great fit, irreversible high-stakes actions demand step-by-step human approval.

How Computer-Use Agents See and Act

What You'll Learn

Two Ways to "See" a Page

The Loop, Step by Step

Why Agents Pause: Checkpoints and Approval

Guardrails Built Into the Sandbox

Is My Task a Good Fit? A Quick Test

Key Takeaways

Quiz

Questions & Answers

How Computer-Use Agents See and Act

What You'll Learn

Two Ways to "See" a Page

The Loop, Step by Step

Why Agents Pause: Checkpoints and Approval

Guardrails Built Into the Sandbox

Is My Task a Good Fit? A Quick Test

Key Takeaways

Quiz

Questions & Answers