Building Production Apps with Claude API
Moving from interactive prompting to production applications requires thinking about prompts differently. You are no longer crafting one-off queries — you are designing systems that run thousands or millions of times, with real cost implications, latency requirements, and failure modes to handle. This lesson covers the architectural decisions that separate prototype-quality Claude integrations from production-ready ones.
System vs. User Messages
The Claude API distinguishes between system prompts and user messages. This distinction matters more than it might appear.
System prompts are ideal for:
- Persona and role definition that applies to every request
- Standing instructions about output format
- Context that never changes (product description, user permissions, business rules)
- Safety constraints and scope limitations
User messages are ideal for:
- The specific task or question for this request
- Dynamic context that changes per request (user input, retrieved data, current state)
- Conversation history
A common mistake is packing everything into the user message. This makes the system prompt useless and forces you to repeat context on every call. The better pattern:
System: You are a customer support assistant for Acme Corp.
You help users with billing, account management, and technical issues.
You do not discuss competitor products or pricing.
Always respond in the same language the user writes in.
Output format: Plain prose. No markdown unless asked.
User: [Dynamic content: user's actual message + relevant account context]
This separation also enables prompt caching, which we cover next.
Prompt Caching
Prompt caching is one of the highest-leverage optimizations available in production Claude applications. When you mark a portion of your prompt with cache_control, Anthropic stores the processed version of that text. Subsequent requests that hit the cache skip the processing cost and latency for that section.
What to Cache
Cache long, stable content that appears in every request:
- System prompts — especially long ones with detailed instructions, personas, or business rules
- Large reference documents — a product catalog, a knowledge base article, a code file being reviewed
- Few-shot examples — if you include 5-10 examples in every prompt, cache them
Do not cache content that changes per request (the user's actual input, retrieved context specific to the user).
Cache Control Breakpoints
The cache_control parameter is set at the end of a content block. Think of it as marking a checkpoint: "cache everything up to and including this point."
Caching the system prompt and reference document means you pay the full token cost only on the first request (or after the cache expires, typically 5 minutes). Subsequent requests pay only for the dynamic portion — typically a 60-90% cost reduction for prompt-heavy applications.
Batch API
The Batch API lets you submit large volumes of requests asynchronously at roughly 50% of the standard API cost. Results are available within 24 hours.
When to Use Batch Processing
Batch is the right choice when:
- You need to process hundreds or thousands of items (document classification, data extraction, content generation at scale)
- Latency is not critical — results in minutes or hours is acceptable
- You want to minimize cost on large workloads
Batch is the wrong choice when:
- A user is waiting for a response
- You need results in under a minute
- The task depends on previous results (sequential workflows)
Structuring Batch Requests
# Each item in a batch is a standalone request
batch_requests = [
{
"custom_id": f"item-{i}", # Your ID for matching results
"params": {
"model": "claude-opus-4-6",
"max_tokens": 512,
"messages": [{"role": "user", "content": item_prompt}]
}
}
for i, item_prompt in enumerate(items_to_process)
]
The custom_id field is critical. Batch results can be returned out of order — your custom_id is how you match each result back to the original input.
Cost Optimization Strategies
Model Routing
Not every task needs the most powerful model. A practical routing strategy:
| Task Type | Recommended Model |
|---|---|
| Simple classification, short Q&A | claude-haiku-3-5 |
| Standard content generation, analysis | claude-sonnet-4-5 |
| Complex reasoning, nuanced analysis | claude-opus-4-6 |
Build model routing into your application logic based on task complexity signals: prompt length, task type classification, or explicit user tier.
Prompt Length Optimization
Every token costs money and adds latency. Audit your prompts for:
- Repeated context that should be in the system prompt (and cached)
- Verbose instructions that can be condensed
- Examples that are longer than necessary to illustrate the point
- Retrieved context that includes irrelevant chunks
A 2,000-token prompt with precise instructions often outperforms a 5,000-token prompt padded with caveats.
Production Prompt Templates: Versioning and Monitoring
In production, your prompts are code. Treat them as such:
Version your prompts in your codebase with semantic versioning:
INVOICE_EXTRACTOR_PROMPT_V2_1 = """..."""
Log prompt versions with each API call so you can correlate output quality changes with prompt changes.
A/B test prompt changes before full rollout: route a percentage of traffic to the new prompt version and compare output quality metrics.
Monitor key metrics per prompt version: accuracy (if ground truth is available), refusal rate, average output length, error rate, latency, and cost per call.
Error Handling and Fallback Strategies
Production applications must handle Claude API failures gracefully.
Rate limits (429 errors): Implement exponential backoff with jitter. Start at 1 second, double each retry, add random jitter to avoid thundering herd. Cap at 3-5 retries.
Timeouts: Set explicit timeouts (typically 30-60 seconds for streaming, longer for batch). Have a fallback path — a cached response, a simpler model, or a graceful degradation to a manual workflow.
Unexpected output format: If you expect JSON and get prose, do not crash. Build a validation layer that catches format errors and either retries with a clarifying prompt or routes to a fallback handler.
Context length exceeded: Split long inputs and aggregate results, or use a summarization step to compress context before the main extraction.
Summary
Production Claude applications succeed through architectural discipline: using system prompts for stable context (and caching them), keeping user messages focused on dynamic input, routing to the right model tier by task complexity, using the Batch API for high-volume async workloads, and building robust error handling for every failure mode. The prompts themselves matter, but at production scale the surrounding infrastructure — caching, versioning, monitoring, and fallbacks — is what separates an MVP from a reliable product.
Discussion
Sign in to join the discussion.

