Home/Agentic AI/Context Windows/Compression Techniques

Managing Context Windows

Master how AI agents manage limited context windows to maintain coherent, efficient conversations

Context Compression Techniques

When context exceeds token limits, compression reduces verbosity while preserving essential information. Instead of dropping entire messages, agents condense them into compact summaries.

Interactive: Compression Simulator

50%
No compressionHigh compression
Result91 tokens (was 182)

User: Python data structures help? Assistant: Lists (mutable, ordered), tuples (immutable), dicts (key-value), sets (unique). Lists use [], support append/insert/remove/slice [start:end], dynamic sizing.

Saved
50%
Tokens
91
Quality
High

🔧 Compression Methods

📝

LLM Summarization

Use the LLM itself to summarize old context. Prompt: "Summarize this conversation preserving key facts and decisions."

✅ Best quality, ❌ Costs API calls
✂️

Extractive Summarization

Select key sentences using TF-IDF, TextRank, or embedding similarity. Extract most important lines without rewriting.

✅ Fast, deterministic, ❌ Less fluent
🗜️

Prompt Compression

Remove filler words, redundant phrases, formatting. "Could you please help me understand" → "Explain"

✅ Very fast, ❌ Loses nuance
🔗

Entity-Focused Compression

Extract entities (names, dates, facts) and relationships. Store as structured data instead of full text.

✅ Very compact, ❌ Requires NER

💡 Compression Best Practices

Compress Old First: Recent messages are more relevant. Compress or remove messages older than N turns (e.g., > 10 turns ago).
Preserve Critical Facts: Never compress: user preferences, system instructions, current task goals, tool results.
Balance Cost vs Quality: Summarization costs tokens. Only compress when approaching limits (e.g., > 80% full).
Test Compression Ratios: 50% compression usually preserves quality. Above 70% risks losing important context.
Prev