Latency & Performance

Master strategies to optimize response times and deliver fast, responsive AI agents

Streaming & Async Patterns

Streaming and async execution transform user experience even when actual latency stays the same. Streaming shows tokens as they generate—users see progress immediately instead of waiting for complete responses. Async patterns prevent blocking: while the LLM processes, your UI stays responsive and can handle other tasks.

Interactive: Streaming vs Batch Comparison

Experience the difference between batch (wait for full response) and streaming (tokens as they arrive):

Click 'Start Batch Simulation' above
Time to First Token
Total Time
0ms

Streaming Best Practices

  • Optimize TTFT: Time to first token is critical—target <300ms
  • Stream UI updates: Show tokens as they arrive, not in large chunks
  • Handle errors gracefully: Stream can fail mid-response—show partial content
  • Add visual feedback: Cursor animation, "typing..." indicator

Async Execution Patterns

Non-blocking operations

Use async/await for I/O operations. Don't block the main thread waiting for LLM responses.

Parallel requests

Execute independent LLM calls concurrently with Promise.all(). 3 sequential 500ms calls = 1500ms. 3 parallel calls = 500ms.

Background processing

Offload non-critical tasks (logging, analytics, embeddings) to queues or background workers. Don't make users wait.

Progressive enhancement

Show cached/fast results immediately, then enhance with LLM results when ready. Example: show rule-based response, then improve with LLM reasoning.

💡
Perceived Speed Matters Most

Users perceive streaming as 40-60% faster than batch even with identical total latency. Show progress, keep UI responsive, provide immediate feedback. A 1-second response that streams feels faster than 800ms batch response with no feedback. Optimize for user experience, not just raw milliseconds.

Optimization Strategies