Capstone: Run and Evaluate a Real Task
Time to put it all together. In this capstone you will pick a tool, run a real multi-step task on an AI browser, and then evaluate how it did using a simple scorecard. The goal is not a flawless run; it is to practice the full loop of briefing, supervising, and judging an agent so that these habits become second nature. By the end you will have a repeatable method for deciding whether any agentic task is worth automating.
What You'll Learn
- How to choose a safe, meaningful capstone task
- A step-by-step run you can complete today
- A scorecard for evaluating agent output honestly
- How to decide whether to keep automating a task or do it yourself
Step 1: Set Up Safely
Before anything, apply the security lesson:
- Use a browser profile that is logged out of your sensitive accounts (banking, primary email, health).
- Pick a task with no irreversible consequences: nothing that spends money without your confirmation, sends messages, or submits anything binding.
- Have your steering template ready (goal, rules, checkpoints, stop, if-stuck).
Step 2: Choose Your Task
Pick one of these, or design your own along the same lines. Each is multi-step, useful, and safe to stop before any consequential action.
- Comparison shopping: find the lowest total price for a specific product across three named retailers, stopping before any cart or purchase.
- Research digest: gather five reputable articles on a topic you care about and produce a one-page summary that flags where they disagree.
- Listings roundup: collect job, apartment, or event listings that match specific criteria into a single comparison table.
- Data extraction: turn a long directory, agenda, or pricing page into a clean spreadsheet-ready table.
Decision
Is my chosen task a safe capstone?
- If It stops before spending or sending
Good. Proceed.
Reversible by design
- If It only reads and summarizes
Even safer. Assistant mode is enough.
Great for a first run
- If It would move money or post publicly
Redesign it to stop before that step.
Never automate the irreversible part
Step 3: Write the Brief
Adapt the steering template to your task. Here is a worked example for the comparison-shopping option:
GOAL: Find the lowest total price (item + shipping to 10001) for a
[specific product + model number], in stock, from Amazon, Best Buy,
or B&H only.
RULES: Official retailer listings only, no marketplace resellers or
ads. Ignore refurbished units. Stay factual about stock and price.
CHECKPOINTS: Show me each retailer's price as you find it.
STOP: Do NOT add to cart or buy. End with a 3-row comparison table
and the direct links, then stop.
IF STUCK: If a site needs a login or CAPTCHA, pause and ask me.
When done, tell me what you did and where you stopped.
Step 4: Run It and Supervise
Launch the task in agent mode (or assistant mode for a reading task) and watch the entire first run. As it works, note:
- Where it hesitated or got confused.
- Whether it respected your rules and stop point.
- Every time it paused for approval, and whether you understood the request.
Resist the urge to walk away. The observation is the learning.
Step 5: Score the Result
Now evaluate honestly with this scorecard. Rate each from 1 (poor) to 5 (excellent).
Score each run on five dimensions to judge it honestly rather than by vibe.
| Criteria | What a 5 looks like |
|---|---|
| Correctness | The facts and prices are accurate when you spot-check them |
| Completeness | It did the whole task, not a partial version |
| Followed rules | It respected every guardrail and the stop point |
| Safety | It paused appropriately and never overreached |
| Time saved | It was genuinely faster than doing it yourself |
What a 5 looks like
- Correctness
- The facts and prices are accurate when you spot-check them
- Completeness
- It did the whole task, not a partial version
- Followed rules
- It respected every guardrail and the stop point
- Safety
- It paused appropriately and never overreached
- Time saved
- It was genuinely faster than doing it yourself
Crucially, verify before you score correctness. Open one or two of the links the agent returned and confirm the prices and stock match what it reported. This is the single habit that separates people who use these tools safely from those who get burned: you trust, but you check the facts you will act on.
Step 6: Decide the Verdict
Add up what you learned and make a call:
Decision
Should I keep automating this task?
- If Accurate, complete, and faster
Keep it. Save the brief as a reusable template.
You found a real workflow
- If Useful but needed heavy supervision
Keep for now, but always supervise.
Net positive, not hands-off
- If Slower or error-prone
Do this one yourself.
Not every task suits an agent
Both "keep it" and "do it yourself" are successful outcomes for this capstone. The win is that you now know, from evidence, rather than guessing. Most people never actually measure whether the agent helped; you just did.
Where to Go From Here
You now have the full toolkit: you understand how agentic browsers perceive and act, you know the 2026 landscape and how to choose among it, you can run research and automation workflows, you can steer an agent with goals and guardrails, and, most importantly, you know how to do all of this safely.
To go deeper, two directions are natural. For the builder's side of agents, explore Get Started with OpenClaw and Building Professional AI Agents with Node.js & TypeScript. To keep sharpening the safety instincts, revisit Prompt Injection Attacks Explained. The agentic web is arriving whether we are ready or not; you now are.
Key Takeaways
- Set up safely first: a logged-out profile and a task with no irreversible consequences.
- Choose a multi-step but reversible task, and write a full brief with a clear stop point.
- Watch the entire first run; the observation is where the learning happens.
- Score every run on correctness, completeness, rule-following, safety, and time saved, and verify facts before trusting them.
- The verdict, whether "keep automating" or "do it yourself," is a win either way because it is now based on evidence, not guesswork.

