AI With Goldfish Memory? How to Master the Context Window and Cut Costs by Up to 90%
The $2,000 I Burned in One Week
Last month, my API bill came in $2,000 above expected. When I investigated, I discovered that 70% of the cost was the system sending the same 10,000-token system prompt on every call — hundreds of times a day, without caching.
I knew the context window was important. What I didn’t know was how much management (or lack thereof) directly impacts the wallet. And when I researched deeply, I discovered this is the most expensive and least understood problem in AI engineering in 2026.
If you’ve ever tried pasting a giant document into an AI chat and got an error, or noticed that after a long conversation the AI started “forgetting” what you discussed earlier — welcome to the Context Window limit. And more importantly: welcome to the guide that’ll teach you not to burn money on it.
What Is the Context Window (No Jargon)
Imagine the Context Window as AI’s “desk.” Everything it needs to process at once must fit on this desk: system instructions (system prompts), conversation history, attached documents, tool outputs, and the response it’s generating.
As of April 2026, sizes vary: Gemini supports up to 1 million tokens, GPT-5 up to 400K (1M on Pro $200), Claude up to 200K. Sounds like a lot. But in practice, usable space is dramatically smaller than advertised.
If you try to put more than the desk can hold, the system crashes or — worse — starts throwing away the oldest information without warning.
The “Bigger Is Better” Myth (Context Rot)
This is the insight that surprised me most and changed how I design systems.
You might think: “Why not use a model with a 1-million-token window and forget this problem?” Careful. In 2026, developers and researchers documented a phenomenon called Context Rot.
A Chroma study tested 18 frontier models and found: all degrade with long context, no exceptions. Accuracy drops over 30% in middle-context positions (the “lost in the middle” phenomenon). Attention concentrates at the beginning and end of input — middle information gets less reliable processing.
Research on Maximum Effective Context Window (MECW) was even more impactful: effective context can drop up to 99% below the advertised limit on complex tasks. A model advertising 200K tokens may work reliably only up to 120-140K. Beyond that, hallucinations, ignored instructions, and contradictory answers increase — even with technical space remaining.
Having a giant window is expensive (LLMs charge per token, and filling 1M tokens costs $2+ in input alone) and often less efficient than a good retrieval system that fetches only what’s relevant.
The 3 Classic Management Strategies
To keep your AI from “losing the thread,” there are three paths every engineer should know:
1. Sliding Window. Keeps only the last N messages (usually 5-10). Fast and simple. But completely loses old history. Ideal for streaming data and tasks where recent context is all that matters — support chatbots with independent questions, for example.
2. Summarization. Summarizes old messages into a short paragraph kept as context. Preserves the general sense of conversation without proportional space. But consumes extra tokens to generate the summary, and summarization itself can lose important details. The leaked Claude Code analysis revealed a 5-layer compression pipeline for exactly this — and even so, 1,279 sessions had 50+ consecutive compaction failures.
3. Retrieval (RAG). Stores everything outside the AI (in a vector database or document index) and pulls only what’s relevant for the current question. Extremely scalable — doesn’t matter if you have 100 pages or 100,000. But more complex to implement correctly (as I discussed in “The Confident Lie” — chunking, embeddings, thresholds, re-ranking).
In practice, production systems combine all three: RAG for long-term memory, summarization for conversation history, and sliding window for recent interactions. The layered combination is what works — no single approach solves it alone.
The Secret Weapon: Prompt Caching
If the three techniques above are the basics, Prompt Caching is what separates professionals from elite — and what would have saved me those $2,000.
The concept: often your system prompt (instructions for how AI should behave) is identical across all calls. Without caching, you pay for the AI to “re-read” those same instructions from scratch every time. With caching, the model saves the processing of repetitive instructions and reuses it on subsequent calls.
A January 2026 arXiv paper — “Don’t Break the Cache” — tested 3 caching strategies across 4 frontier models (OpenAI, Anthropic, Google) with 500 agent sessions. Results:
Cost savings: 41% to 80% depending on provider and caching strategy. All four models showed statistically significant reductions.
Anthropic offers prompt caching with a 90% discount on cached tokens — tokens that normally cost $3/M drop to $0.30/M when cached. OpenAI offers 50% discount on cached tokens. Google discounts up to 75%.
The paper identified a real risk: cache invalidation. If you modify the system prompt frequently, the cache invalidates and you lose the benefit. The recommendation: keep the system prompt stable and put dynamic content (tool results, user data) in separate blocks that don’t invalidate the base instruction cache.
Semantic Caching: The Next Level
For those who want to go deeper: semantic caching goes beyond prompt caching. Instead of caching identical prompts, it recognizes when different queries mean the same thing and reuses previous responses without calling the LLM again.
Redis LangCache implements this using vector embeddings: stores previous responses as vectors and, when a new query is semantically similar to a previous one, returns the cached result — bypassing the LLM entirely. Savings of 50-80% for use cases with repetitive queries (customer support, FAQs, documentation lookups).
This is especially relevant for combating context rot: queries that a frustrated user rephrases in different ways (semantically similar but differently worded) don’t need reprocessing — each reprocessing costs money and, with context rot, delivers worse results.
How Much Ignoring This Costs
To contextualize the financial impact:
A single call filling GPT-5’s 1M-token context window costs $2+ in input alone — before any output. If your application makes hundreds of such calls per user session, costs escalate fast. Claude charges $6/$22.50 per million tokens above 200K (vs $3/$15 standard) — long context has a surcharge.
And the cost isn’t just financial: latency increases linearly with context size. More tokens = more processing time = worse user experience.
The combination of RAG (fetch only the relevant) + Prompt Caching (don’t reprocess instructions) + Semantic Caching (don’t reprocess similar queries) can reduce costs by 70-90% compared to the naive approach of “dump everything into context.”
What I Changed After the $2,000
Three concrete changes I implemented:
Stable system prompt + separate dynamic content. My system prompt is now fixed and cached. All changing content (user data, RAG results, tool outputs) goes in separate blocks that don’t invalidate the base cache.
Context threshold. Before each call, I check how full the context window is. If it exceeds 60%, I trigger automatic history summarization. I never approach the limit anymore.
Per-call cost monitoring. I implemented logs recording input and output tokens for every call. I know exactly how much each feature costs — and can identify when something is burning tokens unnecessarily. The $2,000 bug would now be caught in hours, not weeks.
Conclusion: Intelligence Is Management
Mastering AI isn’t just about the smartest model. It’s about how you manage that model’s “attention.” The context window is a finite resource — treat it as critical infrastructure, not infinite space.
Using RAG for memory, summarization for history, sliding window for recency, prompt caching for savings, and semantic caching for repetitive queries — this stack is what makes an AI project viable in the real world. Without it, you’re burning money and getting worse answers at the same time.
And if after reading all this you’re thinking “this seems too complicated” — I thought so too. Until the bill arrived.
Share if this saved your budget:
- Email: fodra@fodra.com.br
- LinkedIn: linkedin.com/in/mauriciofodra
AI doesn’t have goldfish memory. It has a fixed-size desk. Your job is deciding what goes on that desk — and how much to pay for the space.
Read Also
- Beyond the Prompt: Why ‘Context’ Is the Magic Word of AI in 2026 — The context window is the resource context engineering manages. This post is the “how.” That one is the “why.”
- The Confident Lie: Demo vs Production — Poorly implemented RAG burns context with irrelevant chunks. The 6-layer playbook starts here.
- Fine-Tuning vs. RAG: The Definitive Guide — RAG is about managing knowledge outside the window. Fine-tuning is about behavior inside it.