Building Production Apps with Claude API

Moving from interactive prompting to production applications requires thinking about prompts differently. You are no longer crafting one-off queries — you are designing systems that run thousands or millions of times, with real cost implications, latency requirements, and failure modes to handle. This lesson covers the architectural decisions that separate prototype-quality Claude integrations from production-ready ones.

System vs. User Messages

The Claude API distinguishes between system prompts and user messages. This distinction matters more than it might appear.

System prompts are ideal for:

Persona and role definition that applies to every request
Standing instructions about output format
Context that never changes (product description, user permissions, business rules)
Safety constraints and scope limitations

User messages are ideal for:

The specific task or question for this request
Dynamic context that changes per request (user input, retrieved data, current state)
Conversation history

A common mistake is packing everything into the user message. This makes the system prompt useless and forces you to repeat context on every call. The better pattern:

System: You are a customer support assistant for Acme Corp.
You help users with billing, account management, and technical issues.
You do not discuss competitor products or pricing.
Always respond in the same language the user writes in.
Output format: Plain prose. No markdown unless asked.

User: [Dynamic content: user's actual message + relevant account context]

This separation also enables prompt caching, which we cover next.

Prompt Caching

Prompt caching is one of the highest-leverage optimizations available in production Claude applications. When you mark a portion of your prompt with cache_control, Anthropic stores the processed version of that text. Subsequent requests that hit the cache skip the processing cost and latency for that section.

What to Cache

Cache long, stable content that appears in every request:

System prompts — especially long ones with detailed instructions, personas, or business rules
Large reference documents — a product catalog, a knowledge base article, a code file being reviewed
Few-shot examples — if you include 5-10 examples in every prompt, cache them

Do not cache content that changes per request (the user's actual input, retrieved context specific to the user).

Cache Control Breakpoints

The cache_control parameter is set at the end of a content block. Think of it as marking a checkpoint: "cache everything up to and including this point."

# Production API Request Structure (Python pseudocode) response = anthropic.messages.create( model="claude-opus-4-6", max_tokens=1024, system=[ { "type": "text", "text": """You are a legal document reviewer for Meridian Law Group. Your role: Review contracts and flag clauses that require attorney attention. Scope: Focus on liability, indemnification, IP ownership, and termination clauses. Output format: JSON array of flagged clauses with location, clause_type, and risk_level. Do not provide legal advice. Flag for review only.""", "cache_control": {"type": "ephemeral"} # Cache this stable system prompt }, { "type": "text", "text": LARGE_LEGAL_REFERENCE_DOCUMENT, # e.g., standard clause library "cache_control": {"type": "ephemeral"} # Cache this large reference doc } ], messages=[ { "role": "user", "content": f"Review this contract:\n\n{user_uploaded_contract}" # No cache_control here — this changes every request } ] )

Caching the system prompt and reference document means you pay the full token cost only on the first request (or after the cache expires, typically 5 minutes). Subsequent requests pay only for the dynamic portion — typically a 60-90% cost reduction for prompt-heavy applications.

Batch API

The Batch API lets you submit large volumes of requests asynchronously at roughly 50% of the standard API cost. Results are available within 24 hours.

When to Use Batch Processing

Batch is the right choice when:

You need to process hundreds or thousands of items (document classification, data extraction, content generation at scale)
Latency is not critical — results in minutes or hours is acceptable
You want to minimize cost on large workloads

Batch is the wrong choice when:

A user is waiting for a response
You need results in under a minute
The task depends on previous results (sequential workflows)

Structuring Batch Requests

# Each item in a batch is a standalone request
batch_requests = [
    {
        "custom_id": f"item-{i}",  # Your ID for matching results
        "params": {
            "model": "claude-opus-4-6",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": item_prompt}]
        }
    }
    for i, item_prompt in enumerate(items_to_process)
]

The custom_id field is critical. Batch results can be returned out of order — your custom_id is how you match each result back to the original input.

Cost Optimization Strategies

Model Routing

Not every task needs the most powerful model. A practical routing strategy:

Task Type	Recommended Model
Simple classification, short Q&A	claude-haiku-3-5
Standard content generation, analysis	claude-sonnet-4-5
Complex reasoning, nuanced analysis	claude-opus-4-6

Build model routing into your application logic based on task complexity signals: prompt length, task type classification, or explicit user tier.

Prompt Length Optimization

Every token costs money and adds latency. Audit your prompts for:

Repeated context that should be in the system prompt (and cached)
Verbose instructions that can be condensed
Examples that are longer than necessary to illustrate the point
Retrieved context that includes irrelevant chunks

A 2,000-token prompt with precise instructions often outperforms a 5,000-token prompt padded with caveats.

Loading Prompt Playground...

Production Prompt Templates: Versioning and Monitoring

In production, your prompts are code. Treat them as such:

Version your prompts in your codebase with semantic versioning:

INVOICE_EXTRACTOR_PROMPT_V2_1 = """..."""

Log prompt versions with each API call so you can correlate output quality changes with prompt changes.

A/B test prompt changes before full rollout: route a percentage of traffic to the new prompt version and compare output quality metrics.

Monitor key metrics per prompt version: accuracy (if ground truth is available), refusal rate, average output length, error rate, latency, and cost per call.

Error Handling and Fallback Strategies

Production applications must handle Claude API failures gracefully.

Rate limits (429 errors): Implement exponential backoff with jitter. Start at 1 second, double each retry, add random jitter to avoid thundering herd. Cap at 3-5 retries.

Timeouts: Set explicit timeouts (typically 30-60 seconds for streaming, longer for batch). Have a fallback path — a cached response, a simpler model, or a graceful degradation to a manual workflow.

Unexpected output format: If you expect JSON and get prose, do not crash. Build a validation layer that catches format errors and either retries with a clarifying prompt or routes to a fallback handler.

Context length exceeded: Split long inputs and aggregate results, or use a summarization step to compress context before the main extraction.

Loading Exercise...

Summary

Production Claude applications succeed through architectural discipline: using system prompts for stable context (and caching them), keeping user messages focused on dynamic input, routing to the right model tier by task complexity, using the Batch API for high-volume async workloads, and building robust error handling for every failure mode. The prompts themselves matter, but at production scale the surrounding infrastructure — caching, versioning, monitoring, and fallbacks — is what separates an MVP from a reliable product.

Building Production Apps with Claude API

System vs. User Messages

The Claude API distinguishes between system prompts and user messages. This distinction matters more than it might appear.

System prompts are ideal for:

Persona and role definition that applies to every request
Standing instructions about output format
Context that never changes (product description, user permissions, business rules)
Safety constraints and scope limitations

User messages are ideal for:

The specific task or question for this request
Dynamic context that changes per request (user input, retrieved data, current state)
Conversation history

A common mistake is packing everything into the user message. This makes the system prompt useless and forces you to repeat context on every call. The better pattern:

System: You are a customer support assistant for Acme Corp.
You help users with billing, account management, and technical issues.
You do not discuss competitor products or pricing.
Always respond in the same language the user writes in.
Output format: Plain prose. No markdown unless asked.

User: [Dynamic content: user's actual message + relevant account context]

This separation also enables prompt caching, which we cover next.

Prompt Caching

What to Cache

Cache long, stable content that appears in every request:

System prompts — especially long ones with detailed instructions, personas, or business rules
Large reference documents — a product catalog, a knowledge base article, a code file being reviewed
Few-shot examples — if you include 5-10 examples in every prompt, cache them

Do not cache content that changes per request (the user's actual input, retrieved context specific to the user).

Cache Control Breakpoints

The cache_control parameter is set at the end of a content block. Think of it as marking a checkpoint: "cache everything up to and including this point."

Batch API

The Batch API lets you submit large volumes of requests asynchronously at roughly 50% of the standard API cost. Results are available within 24 hours.

When to Use Batch Processing

Batch is the right choice when:

You need to process hundreds or thousands of items (document classification, data extraction, content generation at scale)
Latency is not critical — results in minutes or hours is acceptable
You want to minimize cost on large workloads

Batch is the wrong choice when:

A user is waiting for a response
You need results in under a minute
The task depends on previous results (sequential workflows)

Structuring Batch Requests

# Each item in a batch is a standalone request
batch_requests = [
    {
        "custom_id": f"item-{i}",  # Your ID for matching results
        "params": {
            "model": "claude-opus-4-6",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": item_prompt}]
        }
    }
    for i, item_prompt in enumerate(items_to_process)
]

The custom_id field is critical. Batch results can be returned out of order — your custom_id is how you match each result back to the original input.

Cost Optimization Strategies

Model Routing

Not every task needs the most powerful model. A practical routing strategy:

Task Type	Recommended Model
Simple classification, short Q&A	claude-haiku-3-5
Standard content generation, analysis	claude-sonnet-4-5
Complex reasoning, nuanced analysis	claude-opus-4-6

Build model routing into your application logic based on task complexity signals: prompt length, task type classification, or explicit user tier.

Prompt Length Optimization

Every token costs money and adds latency. Audit your prompts for:

Repeated context that should be in the system prompt (and cached)
Verbose instructions that can be condensed
Examples that are longer than necessary to illustrate the point
Retrieved context that includes irrelevant chunks

A 2,000-token prompt with precise instructions often outperforms a 5,000-token prompt padded with caveats.

Loading Prompt Playground...

Production Prompt Templates: Versioning and Monitoring

In production, your prompts are code. Treat them as such:

Version your prompts in your codebase with semantic versioning:

INVOICE_EXTRACTOR_PROMPT_V2_1 = """..."""

Log prompt versions with each API call so you can correlate output quality changes with prompt changes.

A/B test prompt changes before full rollout: route a percentage of traffic to the new prompt version and compare output quality metrics.

Monitor key metrics per prompt version: accuracy (if ground truth is available), refusal rate, average output length, error rate, latency, and cost per call.

Error Handling and Fallback Strategies

Production applications must handle Claude API failures gracefully.

Rate limits (429 errors): Implement exponential backoff with jitter. Start at 1 second, double each retry, add random jitter to avoid thundering herd. Cap at 3-5 retries.

Context length exceeded: Split long inputs and aggregate results, or use a summarization step to compress context before the main extraction.

Loading Exercise...

Building Production Apps with Claude API

System vs. User Messages

Prompt Caching

What to Cache

Cache Control Breakpoints

Batch API

When to Use Batch Processing

Structuring Batch Requests

Cost Optimization Strategies

Model Routing

Prompt Length Optimization

Production Prompt Templates: Versioning and Monitoring

Error Handling and Fallback Strategies

Summary

Questions & Answers

Building Production Apps with Claude API

System vs. User Messages

Prompt Caching

What to Cache

Cache Control Breakpoints

Batch API

When to Use Batch Processing

Structuring Batch Requests

Cost Optimization Strategies

Model Routing

Prompt Length Optimization

Production Prompt Templates: Versioning and Monitoring

Error Handling and Fallback Strategies

Summary

Questions & Answers