| Context Engineering. Source: Rathish Kumar's Blog |
Building Durable AI Systems · Part 2 of 5 · builds on the canonical overview and Part 1.
The same prompt, sent to the same model, can produce a sound recommendation one day and a confidently wrong one the next. The prompt was usually identical. The context around it was not.
Context engineering is deciding what the model gets to see before it answers. That choice often matters more than the wording of the prompt. A lot of behavior that looks like model inconsistency is really context inconsistency.
Every answer is built from two piles of knowledge. One is what the model learned during training: broad, frozen at a cutoff date, and the same for everyone. The other is what you put into the context window for this request: current, private, and under your control. The skill is knowing when the model's memory is enough and when you need to bring the evidence with you.
This is also why Google's recent paper on the new software lifecycle frames an agent as more than a model. It is a model plus the harness around it: instructions, tools, memory, retrieval, guardrails, orchestration, and observability.5 Context engineering is the part of that harness that decides what evidence and rules the model can reason from right now.
The model is only one input. The context is the system deciding what the model is allowed to know.
Choose The Knowledge Source First
Before you build the context window, decide where the knowledge should come from. These sources are not interchangeable.
Pretrained knowledge is what the model already has. It is instant and free per call, and it works for stable, general topics. Web search adds current public information when the answer changes faster than a training cutoff. Retrieval from your own documents, the RAG path, brings in knowledge the model could never have learned: your decision records, runbooks, and incident history.1 Fine-tuning changes the model's weights, so use it for stable patterns and style, not facts that will change next quarter.
Internal protocols, narrow standards, and proprietary frameworks are where the model's memory is weakest and its confidence is most dangerous. Bring the source text. Add a short glossary for overloaded terms, for example: 'circuit-id: the unique identifier for a leased-line segment in our provisioning system, not a general networking term.' Give one or two worked examples. Then check every claim against the text you supplied.
The trade-off is simple: freshness and specificity cost latency and complexity. Pretrained knowledge is cheap but generic. Retrieval and search add useful evidence, but the extra step can fail. The mistake is using retrieval when memory was enough, or trusting memory when the answer lives in your private data.
The Window Is Built, Not Given
Once you know the sources, you still have to assemble the request. In a real system the context window is a stack of layers, each with its own owner and failure mode.
In most systems, retrieved context is where quality breaks first. It is the least controlled layer, the most variable layer, and the one most often misdiagnosed as a model problem.
The Architecture Review Assistant makes this concrete. Its system context fixes the reviewer's rules. The user asks about a subscription-billing event store at roughly five hundred events a second, leaning toward a database the team already knows. Retrieval adds the decision record for approved databases and an incident report from a single-instance failure on the payment path. The final recommendation is shaped as much by those two passages as by the user's question. If retrieval gets those wrong, the model will be wrong for a perfectly understandable reason.
You are a senior architecture reviewer.
- Always flag single points of failure
- Ask for RTO/RPO before recommending storage
- Reference internal ADRs by ID when relevantApproved DBs: PostgreSQL (primary), Redis (cache).
MongoDB: not approved, requires review-board exception.
Last reviewed: 18 months ago. Owner: platform-infra.Cause: single RDS instance in payments-service.
Failover took 4m22s during an AZ outage. SLA breach.
Remediation: Multi-AZ required for payment-path services.New event store for subscription billing, ~500/sec at
peak. Thinking MongoDB since the team knows it.The answer is shaped as much by the two retrieved panes as by the question.
Static Context Costs Money. Dynamic Context Costs Judgment.
A practical question follows: what belongs in every request, and what should be loaded only when the task needs it? Static context is always present, so it is reliable and expensive. Dynamic context is pulled in on demand, so it is cheaper per turn, but only as good as the routing and retrieval that selected it.
| One useful way to visualize the static/dynamic split across instructions, knowledge, memory, examples, tools, and guardrails. Image credit: Addy Osmani, from The New Software Lifecycle.6 |
This is where context engineering becomes an operating-cost decision, not just a quality decision. If a rule is business-critical, static context may be worth the token cost. Every request must obey it. If a document matters only for one product area, loading it every time burns tokens and buries signal. A durable system keeps stable constraints close, retrieves task-specific evidence late, and measures whether the dynamic layer brought back the right material.
Retrieval has its own cost curve too. A single lookup may be cheap, but production retrieval often includes query rewriting, embedding search, metadata filters, permission checks, reranking, freshness checks, and sometimes another model call before the answer is generated. Each step can improve relevance, but each step also adds latency, operational surface area, and another place for failure. Use retrieval for knowledge that is current, private, narrow, or too large to keep in static context, not as a reflex for every request.
Context Has An Order And A Budget
Assembly is not only about which layers go in. It is also about order and size. Each layer competes for a finite window, and position matters. Do not bury the evidence the model needs between low-value filler. A useful default is to start with stable, high-authority rules, add the most relevant retrieved passages, put the user's actual request where it cannot be missed, and trim anything that has not earned its tokens.
Conversation history is the layer that quietly breaks this. Prior turns accumulate until they crowd out system rules and retrieved evidence by volume alone. The fix is to manage history instead of appending forever: summarize older turns into compact state, keep the last few turns verbatim, and drop what no longer matters. That compression is a design choice, because a summary can lose a detail the user mentioned twenty turns ago. Test it directly. Run your eval set against simulated 15- to 20-turn conversations, not only single-turn prompts. A failure that appears at turn 18 will not show up in a standard eval run.
Retrieval Pollution: The Characteristic Failure
Retrieved context has a failure mode that deserves its own name, because it is misdiagnosed constantly. Retrieval pollution is when you fetch a document that is on-topic but not actually relevant. The model has no way to judge relevance beyond what is in the window, so it reasons confidently from the polluting document, the output is wrong, and the user concludes the model hallucinated. The real fault was the retrieval step.
This is not only a theoretical failure. RAG research often distinguishes between relevant passages, random noise, and distracting passages: documents that are semantically close to the query but do not contain the answer. Those distracting passages are especially dangerous because they look useful to the retriever and plausible to the model. Recent work has shown that irrelevant retrieved passages can reduce answer accuracy, and that stronger retrievers may surface harder distractors because they are better at finding topically similar material.7
Two related failures sit alongside it. Stale context is a retrieved document that is months out of date and produces a confident, wrong answer keyed to old information. Context overload is dumping everything that looks relevant into the window and burying the signal under volume. All three are retrieval problems, not model problems, and you cannot fix a retrieval problem by changing the prompt.
Retrieval Is A Pipeline, Not A Lookup
The reason retrieval fails in so many ways is that it is a multi-stage pipeline, and each stage has its own failure surface. Treating it as a single "search the docs" step hides where the quality is actually lost.
The decision record the assistant retrieves is a concrete example: it was last reviewed eighteen months ago, and surfacing that date is what lets the reviewer flag that the "approved databases" list may itself be due for review, rather than citing it as current gospel.
More Context Can Make The Answer Worse
It is tempting to think that if relevant context helps, more context is safer. The evidence does not support that. Model performance can degrade as the input grows long, even when the answer is still inside the window, an effect now studied as context rot.2 A related pattern is that models use information at the start and end of a long context more reliably than information buried in the middle.3
The useful mental model is working memory, not storage. A two-hundred-thousand-token window does not make two hundred thousand tokens free or neutral. Treat context as scarce and ordered: put the important material where the model attends best, keep only what the task needs, and resist padding the window just because there is room.4 Vendors will keep raising limits. The budget discipline will still matter, because capacity is not the same thing as attention.
Multimodal Context Raises The Debugging Cost
Context is not only text. Many real workflows combine an image, a code listing, logs, and a written question in one request. A model that accepts multiple modalities can reason across them. The gain is real: a diagram plus the code it describes plus the error logs gives the model more to work with than any one artifact alone.
Consider an incident-review variant of the assistant. An engineer attaches the architecture diagram for a service, the relevant section of its configuration, and the error logs from an outage, then asks what failed. With all three in context, the model can connect a load balancer in the diagram to a health-check setting in the config to a pattern of timeouts in the logs, an inference no single input supports on its own. That is the case for multimodal context.
A text-only request can be logged, replayed, and diffed; a multimodal request carries images and structured files that are expensive to store, difficult to redact, and not easily replayed against a different model. When a multimodal analysis goes wrong, reconstructing what the model actually saw is significantly harder than for a text-only equivalent.
A related failure is contradictory inputs: if the diagram shows a multi-instance setup but the config is for a single instance, the model reconciles both, usually wrongly, and reasons straight through the contradiction rather than flagging it. Label clearly which artifact is which, and add multimodal inputs only when the reasoning gain is clear and the debugging cost has been accounted for.
A practical test: if the same task can be accomplished by converting the image or file to a text description and passing that instead, do so. The text path is cheaper to log, replay, and debug.
Test Retrieval Before You Test The Model
Because so many "model" failures are really context failures, test retrieval on its own. Build this test before the end-to-end test, not after. When a full run fails, the engineer debugging it should not have to guess whether the prompt, retrieval, or model caused the problem. A passing retrieval test narrows the search immediately. Given a known query, did retrieval return the documents it should?
The test is concrete and cheap. Create queries paired with the document IDs that should be retrieved for each one, then track two numbers. Recall asks whether the right documents showed up at all. Precision asks how much of what came back was actually relevant, which is the direct measure of pollution. For the assistant, a billing-event-store query should retrieve the approved-databases record and the payment-path incident. If it also retrieves an unrelated logging-retention record, precision drops and you found the pollution source before it reached the model. Tracking these numbers turns retrieval quality from a hunch into something you can defend.
Chunking deserves the same scrutiny. A paragraph that depends on the one before it will mislead the model when retrieval surfaces it alone, so how you split documents is part of retrieval quality, not a preprocessing detail. Context has to be engineered and measured, not assumed. Most inconsistency lives in the context, and the trace of the assembled window usually tells you why.
When output quality drifts, look at what the model was shown before you look at the model.
What This Changes In How You Build
Once you treat context as an engineered layer, a few defaults change. You choose the knowledge source instead of dropping everything into the prompt. You assemble the window with order and budget in mind instead of appending until it fits. You measure retrieval with precision and recall, so context failures are diagnosed as context failures. You manage conversation history because a window that grows forever is a system that degrades as it runs. The most useful habit is simple: read the assembled context, the exact bytes the model saw, before blaming the model.
References
- Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
- Hong, Troynikov & Huber (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. trychroma.com
- Liu et al. (2023/2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
- Anthropic (2025). Effective Context Engineering for AI Agents. anthropic.com
- Google (2026). The New SDLC With Vibe Coding. Full paper
- Osmani, Saboo & Kartakis (2026). The New Software Lifecycle. Author commentary on Google's paper. addyosmani.com
- Wu et al. (2025). The Distracting Effect: Understanding Irrelevant Passages in RAG. ACL. ACL Anthology
Part 2 of Building Durable AI Systems. Most model inconsistency is context inconsistency.
Continue The Conversation
If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.
LinkedIn: Rathish Kumar B
Contact: Contact Me
For a faster response, use one of these subjects:
- AI Systems
- Architecture Review
- Database Engineering
- Platform Engineering
A few lines of context always help.