Capstone: Run and Evaluate a Real Task

Time to put it all together. In this capstone you will pick a tool, run a real multi-step task on an AI browser, and then evaluate how it did using a simple scorecard. The goal is not a flawless run; it is to practice the full loop of briefing, supervising, and judging an agent so that these habits become second nature. By the end you will have a repeatable method for deciding whether any agentic task is worth automating.

What You'll Learn

How to choose a safe, meaningful capstone task
A step-by-step run you can complete today
A scorecard for evaluating agent output honestly
How to decide whether to keep automating a task or do it yourself

Step 1: Set Up Safely

Before anything, apply the security lesson:

Use a browser profile that is logged out of your sensitive accounts (banking, primary email, health).
Pick a task with no irreversible consequences: nothing that spends money without your confirmation, sends messages, or submits anything binding.
Have your steering template ready (goal, rules, checkpoints, stop, if-stuck).

Step 2: Choose Your Task

Pick one of these, or design your own along the same lines. Each is multi-step, useful, and safe to stop before any consequential action.

Comparison shopping: find the lowest total price for a specific product across three named retailers, stopping before any cart or purchase.
Research digest: gather five reputable articles on a topic you care about and produce a one-page summary that flags where they disagree.
Listings roundup: collect job, apartment, or event listings that match specific criteria into a single comparison table.
Data extraction: turn a long directory, agenda, or pricing page into a clean spreadsheet-ready table.

Decision

Is my chosen task a safe capstone?

If It stops before spending or sending
Good. Proceed.
Reversible by design
If It only reads and summarizes
Even safer. Assistant mode is enough.
Great for a first run
If It would move money or post publicly
Redesign it to stop before that step.
Never automate the irreversible part

Step 3: Write the Brief

Adapt the steering template to your task. Here is a worked example for the comparison-shopping option:

GOAL: Find the lowest total price (item + shipping to 10001) for a
[specific product + model number], in stock, from Amazon, Best Buy,
or B&H only.
RULES: Official retailer listings only, no marketplace resellers or
ads. Ignore refurbished units. Stay factual about stock and price.
CHECKPOINTS: Show me each retailer's price as you find it.
STOP: Do NOT add to cart or buy. End with a 3-row comparison table
and the direct links, then stop.
IF STUCK: If a site needs a login or CAPTCHA, pause and ask me.
When done, tell me what you did and where you stopped.

Step 4: Run It and Supervise

Launch the task in agent mode (or assistant mode for a reading task) and watch the entire first run. As it works, note:

Where it hesitated or got confused.
Whether it respected your rules and stop point.
Every time it paused for approval, and whether you understood the request.

Resist the urge to walk away. The observation is the learning.

Step 5: Score the Result

Now evaluate honestly with this scorecard. Rate each from 1 (poor) to 5 (excellent).

Score each run on five dimensions to judge it honestly rather than by vibe.

Score each run on five dimensions to judge it honestly rather than by vibe.
Criteria	What a 5 looks like
Correctness	The facts and prices are accurate when you spot-check them
Completeness	It did the whole task, not a partial version
Followed rules	It respected every guardrail and the stop point
Safety	It paused appropriately and never overreached
Time saved	It was genuinely faster than doing it yourself

What a 5 looks like

Correctness: The facts and prices are accurate when you spot-check them
Completeness: It did the whole task, not a partial version
Followed rules: It respected every guardrail and the stop point
Safety: It paused appropriately and never overreached
Time saved: It was genuinely faster than doing it yourself

Crucially, verify before you score correctness. Open one or two of the links the agent returned and confirm the prices and stock match what it reported. This is the single habit that separates people who use these tools safely from those who get burned: you trust, but you check the facts you will act on.

Step 6: Decide the Verdict

Add up what you learned and make a call:

Decision

Should I keep automating this task?

If Accurate, complete, and faster
Keep it. Save the brief as a reusable template.
You found a real workflow
If Useful but needed heavy supervision
Keep for now, but always supervise.
Net positive, not hands-off
If Slower or error-prone
Do this one yourself.
Not every task suits an agent

Both "keep it" and "do it yourself" are successful outcomes for this capstone. The win is that you now know, from evidence, rather than guessing. Most people never actually measure whether the agent helped; you just did.

Where to Go From Here

You now have the full toolkit: you understand how agentic browsers perceive and act, you know the 2026 landscape and how to choose among it, you can run research and automation workflows, you can steer an agent with goals and guardrails, and, most importantly, you know how to do all of this safely.

To go deeper, two directions are natural. For the builder's side of agents, explore Get Started with OpenClaw and Building Professional AI Agents with Node.js & TypeScript. To keep sharpening the safety instincts, revisit Prompt Injection Attacks Explained. The agentic web is arriving whether we are ready or not; you now are.

Key Takeaways

Set up safely first: a logged-out profile and a task with no irreversible consequences.
Choose a multi-step but reversible task, and write a full brief with a clear stop point.
Watch the entire first run; the observation is where the learning happens.
Score every run on correctness, completeness, rule-following, safety, and time saved, and verify facts before trusting them.
The verdict, whether "keep automating" or "do it yourself," is a win either way because it is now based on evidence, not guesswork.

Capstone: Run and Evaluate a Real Task

What You'll Learn

How to choose a safe, meaningful capstone task
A step-by-step run you can complete today
A scorecard for evaluating agent output honestly
How to decide whether to keep automating a task or do it yourself

Step 1: Set Up Safely

Before anything, apply the security lesson:

Use a browser profile that is logged out of your sensitive accounts (banking, primary email, health).
Pick a task with no irreversible consequences: nothing that spends money without your confirmation, sends messages, or submits anything binding.
Have your steering template ready (goal, rules, checkpoints, stop, if-stuck).

Step 2: Choose Your Task

Pick one of these, or design your own along the same lines. Each is multi-step, useful, and safe to stop before any consequential action.

Comparison shopping: find the lowest total price for a specific product across three named retailers, stopping before any cart or purchase.
Research digest: gather five reputable articles on a topic you care about and produce a one-page summary that flags where they disagree.
Listings roundup: collect job, apartment, or event listings that match specific criteria into a single comparison table.
Data extraction: turn a long directory, agenda, or pricing page into a clean spreadsheet-ready table.

Decision

Is my chosen task a safe capstone?

If It stops before spending or sending
Good. Proceed.
Reversible by design
If It only reads and summarizes
Even safer. Assistant mode is enough.
Great for a first run
If It would move money or post publicly
Redesign it to stop before that step.
Never automate the irreversible part

Step 3: Write the Brief

Adapt the steering template to your task. Here is a worked example for the comparison-shopping option:

GOAL: Find the lowest total price (item + shipping to 10001) for a
[specific product + model number], in stock, from Amazon, Best Buy,
or B&H only.
RULES: Official retailer listings only, no marketplace resellers or
ads. Ignore refurbished units. Stay factual about stock and price.
CHECKPOINTS: Show me each retailer's price as you find it.
STOP: Do NOT add to cart or buy. End with a 3-row comparison table
and the direct links, then stop.
IF STUCK: If a site needs a login or CAPTCHA, pause and ask me.
When done, tell me what you did and where you stopped.

Step 4: Run It and Supervise

Launch the task in agent mode (or assistant mode for a reading task) and watch the entire first run. As it works, note:

Where it hesitated or got confused.
Whether it respected your rules and stop point.
Every time it paused for approval, and whether you understood the request.

Resist the urge to walk away. The observation is the learning.

Step 5: Score the Result

Now evaluate honestly with this scorecard. Rate each from 1 (poor) to 5 (excellent).

Score each run on five dimensions to judge it honestly rather than by vibe.

Score each run on five dimensions to judge it honestly rather than by vibe.
Criteria	What a 5 looks like
Correctness	The facts and prices are accurate when you spot-check them
Completeness	It did the whole task, not a partial version
Followed rules	It respected every guardrail and the stop point
Safety	It paused appropriately and never overreached
Time saved	It was genuinely faster than doing it yourself

What a 5 looks like

Correctness: The facts and prices are accurate when you spot-check them
Completeness: It did the whole task, not a partial version
Followed rules: It respected every guardrail and the stop point
Safety: It paused appropriately and never overreached
Time saved: It was genuinely faster than doing it yourself

Step 6: Decide the Verdict

Add up what you learned and make a call:

Decision

Should I keep automating this task?

If Accurate, complete, and faster
Keep it. Save the brief as a reusable template.
You found a real workflow
If Useful but needed heavy supervision
Keep for now, but always supervise.
Net positive, not hands-off
If Slower or error-prone
Do this one yourself.
Not every task suits an agent

Where to Go From Here

Key Takeaways

Set up safely first: a logged-out profile and a task with no irreversible consequences.
Choose a multi-step but reversible task, and write a full brief with a clear stop point.
Watch the entire first run; the observation is where the learning happens.
Score every run on correctness, completeness, rule-following, safety, and time saved, and verify facts before trusting them.
The verdict, whether "keep automating" or "do it yourself," is a win either way because it is now based on evidence, not guesswork.

Capstone: Run and Evaluate a Real Task

What You'll Learn

Step 1: Set Up Safely

Step 2: Choose Your Task

Step 3: Write the Brief

Step 4: Run It and Supervise

Step 5: Score the Result

What a 5 looks like

Step 6: Decide the Verdict

Where to Go From Here

Key Takeaways

Quiz

Questions & Answers

Capstone: Run and Evaluate a Real Task

What You'll Learn

Step 1: Set Up Safely

Step 2: Choose Your Task

Step 3: Write the Brief

Step 4: Run It and Supervise

Step 5: Score the Result

What a 5 looks like

Step 6: Decide the Verdict

Where to Go From Here

Key Takeaways

Quiz

Questions & Answers