Building a Production Agent System

The techniques from previous lessons -- subagents, hooks, MCP servers, and headless mode -- combine into something greater than their parts: a production agent system. This lesson walks through designing, building, and operating an automated system where Claude Code agents handle real work on a schedule, with monitoring, cost controls, and failure recovery.

What You Will Learn

Architecture of a Claude Code agent system
Scheduler, heartbeat, and notification agents
Content generation pipelines for blogs, news, and courses
SEO audit automation
Cost monitoring and token management

Agent System Architecture

A production agent system consists of several layers:

┌─────────────────────────────────────────────┐
│                  Scheduler                   │
│    (cron / systemd / GitHub Actions)         │
├─────────────────────────────────────────────┤
│               Orchestrator Agent             │
│    Reads task queue, delegates to workers    │
├──────────┬──────────┬──────────┬────────────┤
│ Content  │  SEO     │  Test    │ Deploy     │
│ Agent    │  Agent   │  Agent   │ Agent      │
├──────────┴──────────┴──────────┴────────────┤
│              Shared Infrastructure           │
│  Git repos, MCP servers, notification hooks  │
└─────────────────────────────────────────────┘

The scheduler triggers runs at defined intervals. The orchestrator reads a task configuration and spawns specialized worker agents. Each worker has its own tools, model, and constraints. Shared infrastructure provides git access, MCP connections, and notification channels.

The Orchestrator Pattern

The orchestrator is a script that launches Claude Code in headless mode with a structured task:

#!/bin/bash
# orchestrator.sh - Run daily agent tasks

TIMESTAMP=$(date +%Y-%m-%d)
LOG_DIR="logs/agents/$TIMESTAMP"
mkdir -p "$LOG_DIR"

# Task 1: Content generation
claude --dangerously-skip-permissions \
  -p "You are the content agent. Generate today's blog post about AI news.
      Use WebSearch to find current news. Save to src/content/blog/.
      Follow the project's MDX format and SEO guidelines." \
  --model claude-sonnet-4-5 \
  --max-turns 25 \
  > "$LOG_DIR/content-agent.log" 2>&1

# Task 2: SEO audit
claude --dangerously-skip-permissions \
  -p "You are the SEO agent. Audit the 10 most recently modified pages
      for SEO compliance. Output a JSON report to reports/seo-audit.json." \
  --model claude-haiku-3-5 \
  --max-turns 15 \
  > "$LOG_DIR/seo-agent.log" 2>&1

# Task 3: Test health check
claude --dangerously-skip-permissions \
  -p "Run the full test suite. If any tests fail, create a GitHub issue
      with the failure details and assign it to the on-call developer." \
  --model claude-sonnet-4-5 \
  --max-turns 10 \
  > "$LOG_DIR/test-agent.log" 2>&1

# Send summary notification
node scripts/notify-slack.js "$LOG_DIR"

Content Generation Pipelines

Content pipelines are one of the most practical applications of agent systems. Here is a complete pipeline for blog post generation:

Pipeline Steps

Research: Agent searches the web for current topics
Deduplication: Agent checks existing content for overlapping topics
Writing: Agent creates the post following project templates
Validation: Script validates MDX compilation and SEO metadata
Review: Changes go to a PR for human review
Publication: After approval, the PR is merged and deployed

Implementation

#!/bin/bash
# content-pipeline.sh

set -e
BRANCH="content/auto-$(date +%Y%m%d)"

# Create a feature branch
git checkout -b "$BRANCH" main

# Step 1-3: Research, dedup, and write
claude --dangerously-skip-permissions \
  -p "Create a blog post about the latest developments in AI coding assistants.

      Before writing:
      1. Search the web for news from the past week
      2. Check existing posts in src/content/blog/ to avoid duplicating topics
      3. If a similar post exists from the last 30 days, choose a different angle

      Writing rules:
      - Follow the MDX format used by existing posts
      - Include proper frontmatter (title, date, description, keywords)
      - Create structured-data.json for SEO
      - Target 1500-2000 words
      - Include at least 3 external source links" \
  --model claude-sonnet-4-5 \
  --max-turns 30

# Step 4: Validate
node scripts/validate-mdx-compilation.js src/content/blog/
npm test -- blog.test.ts

# Step 5: Create PR for review
git add src/content/blog/
git commit -m "feat: add auto-generated blog post for $(date +%Y-%m-%d)"
git push -u origin "$BRANCH"
gh pr create \
  --title "Auto-generated blog post: $(date +%Y-%m-%d)" \
  --body "Automated content pipeline. Please review before merging."

Course Generation Pipeline

For a learning platform, course generation follows a similar pattern but with more structure:

claude --dangerously-skip-permissions \
  -p "Create a new course about Docker fundamentals.

      Follow the exact structure in existing courses:
      1. Create course-structure.json with 8 lessons in 3 modules
      2. Create MDX lesson files (no frontmatter, escape curly braces)
      3. Create quiz JSON files for each lesson (4-5 questions each)
      4. Create final-exam-questions.json (22+ questions)
      5. Update courses.json with the new course entry
      6. Run: node scripts/validate-mdx-compilation.js on the course directory
      7. Run: npm test -- courses.test.ts to verify" \
  --model claude-opus-4-6 \
  --max-turns 50

SEO Audit Automation

An automated SEO audit agent can run weekly and report issues:

#!/bin/bash
# seo-audit.sh

REPORT="reports/seo-audit-$(date +%Y-%m-%d).json"

claude --dangerously-skip-permissions \
  -p "Perform a comprehensive SEO audit:

      1. Check all page components in src/app/ for:
         - Meta title length (target: 50-60 characters)
         - Meta description length (target: 150-160 characters)
         - Open Graph tags (og:title, og:description, og:image)
         - Structured data / JSON-LD
         - Heading hierarchy (single H1, logical H2-H6 order)
         - Image alt text presence

      2. Check src/content/ for:
         - Missing SEO metadata in frontmatter
         - Duplicate titles or descriptions
         - Missing keywords

      3. Output results as JSON to $REPORT with this structure:
         \{
           \"timestamp\": \"...\",
           \"totalPages\": N,
           \"issues\": [
             \{
               \"file\": \"path\",
               \"severity\": \"critical|warning|info\",
               \"issue\": \"description\",
               \"fix\": \"suggested fix\"
             \}
           ]
         \}" \
  --model claude-sonnet-4-5 \
  --max-turns 20 \
  --output-format json > "$REPORT"

# Alert if critical issues found
CRITICAL_COUNT=$(node -e "
  const r = require('./$REPORT');
  console.log((r.issues || []).filter(i => i.severity === 'critical').length);
")

if [ "$CRITICAL_COUNT" -gt 0 ]; then
  echo "ALERT: $CRITICAL_COUNT critical SEO issues found"
  # Send notification
  node scripts/notify-slack.js "SEO Audit: $CRITICAL_COUNT critical issues" "$REPORT"
fi

Cost Monitoring and Token Management

Token costs can escalate quickly with automated agents. Implement these controls:

Per-Run Budget Controls

# Set max-turns to limit token consumption
claude --dangerously-skip-permissions \
  -p "..." \
  --max-turns 15  # Hard limit on conversation turns

Cost Tracking Script

#!/bin/bash
# track-costs.sh - Log token usage per agent run

AGENT_NAME=$1
START_TIME=$(date +%s)

# Run agent and capture output
OUTPUT=$(claude --dangerously-skip-permissions \
  -p "$2" \
  --model "$3" \
  --max-turns "$4" \
  --output-format json 2>&1)

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# Log the run
echo "\{
  \"agent\": \"$AGENT_NAME\",
  \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
  \"model\": \"$3\",
  \"max_turns\": $4,
  \"duration_seconds\": $DURATION,
  \"status\": \"completed\"
\}" >> logs/agent-costs.jsonl

Cost Optimization Strategies

Use the cheapest model that works: Haiku for simple tasks, Sonnet for most tasks, Opus only when needed
Set aggressive max-turns: Most tasks complete in 10-20 turns. Set limits to prevent runaway sessions
Cache common operations: If multiple agents read the same files, pre-read them and pass as context
Run at off-peak times: Some providers offer lower rates during off-peak hours
Monitor weekly trends: Track cost per agent type and investigate spikes

Failure Recovery

Agents can fail. Build resilience into your system:

#!/bin/bash
# resilient-agent.sh

MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  claude --dangerously-skip-permissions \
    -p "$1" \
    --model claude-sonnet-4-5 \
    --max-turns 15

  if [ $? -eq 0 ]; then
    echo "Agent completed successfully"
    exit 0
  fi

  RETRY_COUNT=$((RETRY_COUNT + 1))
  echo "Attempt $RETRY_COUNT failed. Retrying..."
  sleep 10
done

echo "Agent failed after $MAX_RETRIES attempts"
node scripts/notify-slack.js "Agent failure: $1"
exit 1

Putting It All Together

A production agent system for a content platform might run like this:

Daily (6 AM):
  - Content agent: Generate 1 blog post → PR for review
  - Test agent: Run full test suite → Issue if failures

Weekly (Monday 9 AM):
  - SEO agent: Full site audit → Report + Slack alert
  - Dependency agent: Check for updates → PR if safe updates available
  - Analytics agent: Generate weekly metrics summary → Email to team

Monthly (1st, 9 AM):
  - Course agent: Propose new course topics based on search trends
  - Link checker: Verify all external links still work
  - Performance agent: Run Lighthouse audits on key pages

Each agent runs in a container, logs its actions, and communicates results through PRs, issues, and Slack notifications. The human team reviews outputs and maintains the agent configurations.

Key Takeaways

A production agent system uses a scheduler, orchestrator, and specialized worker agents
Content pipelines combine research, deduplication, writing, validation, and PR-based review
SEO audit agents can run on a schedule and alert on critical issues
Cost control requires max-turns limits, model selection strategy, and usage tracking
Build failure recovery with retry logic and notification hooks
Always route automated changes through PRs for human review before merging to main
Start small with one agent, prove the pattern, then expand to a full system

Building a Production Agent System

What You Will Learn

Architecture of a Claude Code agent system
Scheduler, heartbeat, and notification agents
Content generation pipelines for blogs, news, and courses
SEO audit automation
Cost monitoring and token management

Agent System Architecture

A production agent system consists of several layers:

┌─────────────────────────────────────────────┐
│                  Scheduler                   │
│    (cron / systemd / GitHub Actions)         │
├─────────────────────────────────────────────┤
│               Orchestrator Agent             │
│    Reads task queue, delegates to workers    │
├──────────┬──────────┬──────────┬────────────┤
│ Content  │  SEO     │  Test    │ Deploy     │
│ Agent    │  Agent   │  Agent   │ Agent      │
├──────────┴──────────┴──────────┴────────────┤
│              Shared Infrastructure           │
│  Git repos, MCP servers, notification hooks  │
└─────────────────────────────────────────────┘

The Orchestrator Pattern

The orchestrator is a script that launches Claude Code in headless mode with a structured task:

#!/bin/bash
# orchestrator.sh - Run daily agent tasks

TIMESTAMP=$(date +%Y-%m-%d)
LOG_DIR="logs/agents/$TIMESTAMP"
mkdir -p "$LOG_DIR"

# Task 1: Content generation
claude --dangerously-skip-permissions \
  -p "You are the content agent. Generate today's blog post about AI news.
      Use WebSearch to find current news. Save to src/content/blog/.
      Follow the project's MDX format and SEO guidelines." \
  --model claude-sonnet-4-5 \
  --max-turns 25 \
  > "$LOG_DIR/content-agent.log" 2>&1

# Task 2: SEO audit
claude --dangerously-skip-permissions \
  -p "You are the SEO agent. Audit the 10 most recently modified pages
      for SEO compliance. Output a JSON report to reports/seo-audit.json." \
  --model claude-haiku-3-5 \
  --max-turns 15 \
  > "$LOG_DIR/seo-agent.log" 2>&1

# Task 3: Test health check
claude --dangerously-skip-permissions \
  -p "Run the full test suite. If any tests fail, create a GitHub issue
      with the failure details and assign it to the on-call developer." \
  --model claude-sonnet-4-5 \
  --max-turns 10 \
  > "$LOG_DIR/test-agent.log" 2>&1

# Send summary notification
node scripts/notify-slack.js "$LOG_DIR"

Content Generation Pipelines

Content pipelines are one of the most practical applications of agent systems. Here is a complete pipeline for blog post generation:

Pipeline Steps

Research: Agent searches the web for current topics
Deduplication: Agent checks existing content for overlapping topics
Writing: Agent creates the post following project templates
Validation: Script validates MDX compilation and SEO metadata
Review: Changes go to a PR for human review
Publication: After approval, the PR is merged and deployed

Implementation

#!/bin/bash
# content-pipeline.sh

set -e
BRANCH="content/auto-$(date +%Y%m%d)"

# Create a feature branch
git checkout -b "$BRANCH" main

# Step 1-3: Research, dedup, and write
claude --dangerously-skip-permissions \
  -p "Create a blog post about the latest developments in AI coding assistants.

      Before writing:
      1. Search the web for news from the past week
      2. Check existing posts in src/content/blog/ to avoid duplicating topics
      3. If a similar post exists from the last 30 days, choose a different angle

      Writing rules:
      - Follow the MDX format used by existing posts
      - Include proper frontmatter (title, date, description, keywords)
      - Create structured-data.json for SEO
      - Target 1500-2000 words
      - Include at least 3 external source links" \
  --model claude-sonnet-4-5 \
  --max-turns 30

# Step 4: Validate
node scripts/validate-mdx-compilation.js src/content/blog/
npm test -- blog.test.ts

# Step 5: Create PR for review
git add src/content/blog/
git commit -m "feat: add auto-generated blog post for $(date +%Y-%m-%d)"
git push -u origin "$BRANCH"
gh pr create \
  --title "Auto-generated blog post: $(date +%Y-%m-%d)" \
  --body "Automated content pipeline. Please review before merging."

Course Generation Pipeline

For a learning platform, course generation follows a similar pattern but with more structure:

claude --dangerously-skip-permissions \
  -p "Create a new course about Docker fundamentals.

      Follow the exact structure in existing courses:
      1. Create course-structure.json with 8 lessons in 3 modules
      2. Create MDX lesson files (no frontmatter, escape curly braces)
      3. Create quiz JSON files for each lesson (4-5 questions each)
      4. Create final-exam-questions.json (22+ questions)
      5. Update courses.json with the new course entry
      6. Run: node scripts/validate-mdx-compilation.js on the course directory
      7. Run: npm test -- courses.test.ts to verify" \
  --model claude-opus-4-6 \
  --max-turns 50

SEO Audit Automation

An automated SEO audit agent can run weekly and report issues:

#!/bin/bash
# seo-audit.sh

REPORT="reports/seo-audit-$(date +%Y-%m-%d).json"

claude --dangerously-skip-permissions \
  -p "Perform a comprehensive SEO audit:

      1. Check all page components in src/app/ for:
         - Meta title length (target: 50-60 characters)
         - Meta description length (target: 150-160 characters)
         - Open Graph tags (og:title, og:description, og:image)
         - Structured data / JSON-LD
         - Heading hierarchy (single H1, logical H2-H6 order)
         - Image alt text presence

      2. Check src/content/ for:
         - Missing SEO metadata in frontmatter
         - Duplicate titles or descriptions
         - Missing keywords

      3. Output results as JSON to $REPORT with this structure:
         \{
           \"timestamp\": \"...\",
           \"totalPages\": N,
           \"issues\": [
             \{
               \"file\": \"path\",
               \"severity\": \"critical|warning|info\",
               \"issue\": \"description\",
               \"fix\": \"suggested fix\"
             \}
           ]
         \}" \
  --model claude-sonnet-4-5 \
  --max-turns 20 \
  --output-format json > "$REPORT"

# Alert if critical issues found
CRITICAL_COUNT=$(node -e "
  const r = require('./$REPORT');
  console.log((r.issues || []).filter(i => i.severity === 'critical').length);
")

if [ "$CRITICAL_COUNT" -gt 0 ]; then
  echo "ALERT: $CRITICAL_COUNT critical SEO issues found"
  # Send notification
  node scripts/notify-slack.js "SEO Audit: $CRITICAL_COUNT critical issues" "$REPORT"
fi

Cost Monitoring and Token Management

Token costs can escalate quickly with automated agents. Implement these controls:

Per-Run Budget Controls

# Set max-turns to limit token consumption
claude --dangerously-skip-permissions \
  -p "..." \
  --max-turns 15  # Hard limit on conversation turns

Cost Tracking Script

#!/bin/bash
# track-costs.sh - Log token usage per agent run

AGENT_NAME=$1
START_TIME=$(date +%s)

# Run agent and capture output
OUTPUT=$(claude --dangerously-skip-permissions \
  -p "$2" \
  --model "$3" \
  --max-turns "$4" \
  --output-format json 2>&1)

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# Log the run
echo "\{
  \"agent\": \"$AGENT_NAME\",
  \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
  \"model\": \"$3\",
  \"max_turns\": $4,
  \"duration_seconds\": $DURATION,
  \"status\": \"completed\"
\}" >> logs/agent-costs.jsonl

Cost Optimization Strategies

Use the cheapest model that works: Haiku for simple tasks, Sonnet for most tasks, Opus only when needed
Set aggressive max-turns: Most tasks complete in 10-20 turns. Set limits to prevent runaway sessions
Cache common operations: If multiple agents read the same files, pre-read them and pass as context
Run at off-peak times: Some providers offer lower rates during off-peak hours
Monitor weekly trends: Track cost per agent type and investigate spikes

Failure Recovery

Agents can fail. Build resilience into your system:

#!/bin/bash
# resilient-agent.sh

MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  claude --dangerously-skip-permissions \
    -p "$1" \
    --model claude-sonnet-4-5 \
    --max-turns 15

  if [ $? -eq 0 ]; then
    echo "Agent completed successfully"
    exit 0
  fi

  RETRY_COUNT=$((RETRY_COUNT + 1))
  echo "Attempt $RETRY_COUNT failed. Retrying..."
  sleep 10
done

echo "Agent failed after $MAX_RETRIES attempts"
node scripts/notify-slack.js "Agent failure: $1"
exit 1

Putting It All Together

A production agent system for a content platform might run like this:

Daily (6 AM):
  - Content agent: Generate 1 blog post → PR for review
  - Test agent: Run full test suite → Issue if failures

Weekly (Monday 9 AM):
  - SEO agent: Full site audit → Report + Slack alert
  - Dependency agent: Check for updates → PR if safe updates available
  - Analytics agent: Generate weekly metrics summary → Email to team

Monthly (1st, 9 AM):
  - Course agent: Propose new course topics based on search trends
  - Link checker: Verify all external links still work
  - Performance agent: Run Lighthouse audits on key pages

Each agent runs in a container, logs its actions, and communicates results through PRs, issues, and Slack notifications. The human team reviews outputs and maintains the agent configurations.

Key Takeaways

A production agent system uses a scheduler, orchestrator, and specialized worker agents
Content pipelines combine research, deduplication, writing, validation, and PR-based review
SEO audit agents can run on a schedule and alert on critical issues
Cost control requires max-turns limits, model selection strategy, and usage tracking
Build failure recovery with retry logic and notification hooks
Always route automated changes through PRs for human review before merging to main
Start small with one agent, prove the pattern, then expand to a full system

Building a Production Agent System

What You Will Learn

Agent System Architecture

The Orchestrator Pattern

Content Generation Pipelines

Pipeline Steps

Implementation

Course Generation Pipeline

SEO Audit Automation

Cost Monitoring and Token Management

Per-Run Budget Controls

Cost Tracking Script

Cost Optimization Strategies

Failure Recovery

Putting It All Together

Key Takeaways

Quiz

Questions & Answers

Building a Production Agent System

What You Will Learn

Agent System Architecture

The Orchestrator Pattern

Content Generation Pipelines

Pipeline Steps

Implementation

Course Generation Pipeline

SEO Audit Automation

Cost Monitoring and Token Management

Per-Run Budget Controls

Cost Tracking Script

Cost Optimization Strategies

Failure Recovery

Putting It All Together

Key Takeaways

Quiz

Questions & Answers