Caching & Request Batching

Caching stores responses to avoid redundant API calls. Batching groups multiple requests into one. Both reduce request volume—the third-biggest cost driver after model selection and token usage. A 40% cache hit rate means 40% fewer API calls, translating directly to 40% cost savings.

Caching Strategies

Semantic Cache

40-60%

Cache similar queries (e.g., "What's 2+2?" and "Calculate two plus two" share cache)

40% hit rate

Exact Match Cache

25-40%

Cache identical queries only. Simple but effective for repeated questions.

25% hit rate

Prompt Prefix Cache

50-70%

Cache common prompt prefixes (system prompts, instructions). Reuse across queries.

50% hit rate

Time-Based Cache (TTL)

30-50%

Cache responses with expiration. Good for slowly-changing data (prices, weather).

35% hit rate

Interactive: Cache & Batch Simulator

Enable caching and batching to see cost impact:

Batching Strategies

•

Time-based batching: Collect requests for 100ms, send as single batch

•

Size-based batching: Batch 10 requests together, process simultaneously

•

Parallel processing: Send multiple independent queries in one API call

💡

Implement Smart Cache Invalidation

Cache responses with appropriate TTLs based on data freshness requirements. Static facts (historical dates, math) can be cached indefinitely. Dynamic data (stock prices, weather) needs shorter TTLs (5-60 minutes). Use cache tags to invalidate related entries when source data changes.

Cost Optimization

Your Progress

Caching & Request Batching

Caching Strategies

Interactive: Cache & Batch Simulator

Batching Strategies