Context Engineering: What the Model Sees Is the System (Part 2) ~ Rathish Kumar's Blog

Context Engineering. Source: Rathish Kumar's Blog

Building Durable AI Systems · Part 2 of 5 · builds on the canonical overview and Part 1.

The same prompt, sent to the same model, can produce a sound recommendation one day and a confidently wrong one the next. The prompt was usually identical. The context around it was not.

Context engineering is deciding what the model gets to see before it answers. That choice often matters more than the wording of the prompt. A lot of behavior that looks like model inconsistency is really context inconsistency.

Every answer is built from two piles of knowledge. One is what the model learned during training: broad, frozen at a cutoff date, and the same for everyone. The other is what you put into the context window for this request: current, private, and under your control. The skill is knowing when the model's memory is enough and when you need to bring the evidence with you.

This is also why Google's recent paper on the new software lifecycle frames an agent as more than a model. It is a model plus the harness around it: instructions, tools, memory, retrieval, guardrails, orchestration, and observability.⁵ Context engineering is the part of that harness that decides what evidence and rules the model can reason from right now.

The model is only one input. The context is the system deciding what the model is allowed to know.

Choose The Knowledge Source First

Before you build the context window, decide where the knowledge should come from. These sources are not interchangeable.

Pick The Source By Freshness And Cost. Move Right Only When The One Before Cannot Answer.

free · instant

Pretrained

Broad, frozen at the training cutoff, identical for everyone. Right for stable, general domains.

current · public

Web search

Facts that move faster than the cutoff. Freshness, with noise to filter.

private · yours

Retrieval (RAG)

Your decision records, runbooks, incident history. Knowledge the model could never have learned.¹

high cost to change

Fine-tuning

Bakes a style or pattern into the weights. Suits stable patterns, not fast-moving facts.

Pretrained knowledge is what the model already has. It is instant and free per call, and it works for stable, general topics. Web search adds current public information when the answer changes faster than a training cutoff. Retrieval from your own documents, the RAG path, brings in knowledge the model could never have learned: your decision records, runbooks, and incident history.¹ Fine-tuning changes the model's weights, so use it for stable patterns and style, not facts that will change next quarter.

✓

When The Model Barely Saw The Domain

Internal protocols, narrow standards, and proprietary frameworks are where the model's memory is weakest and its confidence is most dangerous. Bring the source text. Add a short glossary for overloaded terms, for example: 'circuit-id: the unique identifier for a leased-line segment in our provisioning system, not a general networking term.' Give one or two worked examples. Then check every claim against the text you supplied.

The trade-off is simple: freshness and specificity cost latency and complexity. Pretrained knowledge is cheap but generic. Retrieval and search add useful evidence, but the extra step can fail. The mistake is using retrieval when memory was enough, or trusting memory when the answer lives in your private data.

The Window Is Built, Not Given

Once you know the sources, you still have to assemble the request. In a real system the context window is a stack of layers, each with its own owner and failure mode.

Six Layers, One Window: Everything You Assemble Collapses Into What The Model Reads

Systemrole + non-negotiable rules (most stable)

Userthe actual request (untrusted input)

Retrieved (RAG)docs fetched at request time (most likely to poison)

Toolresults of calls, folded back in

Conversationprior turns, grow without bound

Environmenttime, locale, caller's permissions

↓

One assembled window the model reasons over at once

Retrieved context loses quality first.

In most systems, retrieved context is where quality breaks first. It is the least controlled layer, the most variable layer, and the one most often misdiagnosed as a model problem.

The Architecture Review Assistant makes this concrete. Its system context fixes the reviewer's rules. The user asks about a subscription-billing event store at roughly five hundred events a second, leaning toward a database the team already knows. Retrieval adds the decision record for approved databases and an incident report from a single-instance failure on the payment path. The final recommendation is shaped as much by those two passages as by the user's question. If retrieval gets those wrong, the model will be wrong for a perfectly understandable reason.

assembled context: architecture review assistant

SYSTEM

You are a senior architecture reviewer.
    - Always flag single points of failure
    - Ask for RTO/RPO before recommending storage
    - Reference internal ADRs by ID when relevant

RETRIEVED: ADR-042

Approved DBs: PostgreSQL (primary), Redis (cache).
    MongoDB: not approved, requires review-board exception.
    Last reviewed: 18 months ago. Owner: platform-infra.

RETRIEVED: INC-2847

Cause: single RDS instance in payments-service.
    Failover took 4m22s during an AZ outage. SLA breach.
    Remediation: Multi-AZ required for payment-path services.

USER

New event store for subscription billing, ~500/sec at
    peak. Thinking MongoDB since the team knows it.

The answer is shaped as much by the two retrieved panes as by the question.

Static Context Costs Money. Dynamic Context Costs Judgment.

A practical question follows: what belongs in every request, and what should be loaded only when the task needs it? Static context is always present, so it is reliable and expensive. Dynamic context is pulled in on demand, so it is cheaper per turn, but only as good as the routing and retrieval that selected it.

Six types of agent context mapped into static context and dynamic context: instructions, knowledge, memory, examples, tools, and guardrails.

One useful way to visualize the static/dynamic split across instructions, knowledge, memory, examples, tools, and guardrails. Image credit: Addy Osmani, from The New Software Lifecycle.⁶

This is where context engineering becomes an operating-cost decision, not just a quality decision. If a rule is business-critical, static context may be worth the token cost. Every request must obey it. If a document matters only for one product area, loading it every time burns tokens and buries signal. A durable system keeps stable constraints close, retrieves task-specific evidence late, and measures whether the dynamic layer brought back the right material.

Retrieval has its own cost curve too. A single lookup may be cheap, but production retrieval often includes query rewriting, embedding search, metadata filters, permission checks, reranking, freshness checks, and sometimes another model call before the answer is generated. Each step can improve relevance, but each step also adds latency, operational surface area, and another place for failure. Use retrieval for knowledge that is current, private, narrow, or too large to keep in static context, not as a reflex for every request.

Context Has An Order And A Budget

Assembly is not only about which layers go in. It is also about order and size. Each layer competes for a finite window, and position matters. Do not bury the evidence the model needs between low-value filler. A useful default is to start with stable, high-authority rules, add the most relevant retrieved passages, put the user's actual request where it cannot be missed, and trim anything that has not earned its tokens.

Conversation history is the layer that quietly breaks this. Prior turns accumulate until they crowd out system rules and retrieved evidence by volume alone. The fix is to manage history instead of appending forever: summarize older turns into compact state, keep the last few turns verbatim, and drop what no longer matters. That compression is a design choice, because a summary can lose a detail the user mentioned twenty turns ago. Test it directly. Run your eval set against simulated 15- to 20-turn conversations, not only single-turn prompts. A failure that appears at turn 18 will not show up in a standard eval run.

Retrieval Pollution: The Characteristic Failure

Retrieved context has a failure mode that deserves its own name, because it is misdiagnosed constantly. Retrieval pollution is when you fetch a document that is on-topic but not actually relevant. The model has no way to judge relevance beyond what is in the window, so it reasons confidently from the polluting document, the output is wrong, and the user concludes the model hallucinated. The real fault was the retrieval step.

This is not only a theoretical failure. RAG research often distinguishes between relevant passages, random noise, and distracting passages: documents that are semantically close to the query but do not contain the answer. Those distracting passages are especially dangerous because they look useful to the retriever and plausible to the model. Recent work has shown that irrelevant retrieved passages can reduce answer accuracy, and that stronger retrievers may surface harder distractors because they are better at finding topically similar material.⁷

Retrieval Pollution: Right Topic, Wrong Evidence

1. Useful Evidence

The retriever finds passages that answer the question: current configuration, approved choices, and relevant incident history.

2. Polluting Evidence

It also finds passages that share the topic but not the answer: old defaults, similar services, or unrelated tuning guides.

↓

3. One Mixed Context Window

The model receives both. If the wrong passage is more specific, more recent-looking, or placed where the model attends best, it can steer the answer.

↓

4. Confident Wrong Output

The response looks grounded because it cites supplied context, but the grounding came from plausible wrong evidence.

Pollution is not random noise. It is plausible wrong evidence that earns a place in the context window.

Two related failures sit alongside it. Stale context is a retrieved document that is months out of date and produces a confident, wrong answer keyed to old information. Context overload is dumping everything that looks relevant into the window and burying the signal under volume. All three are retrieval problems, not model problems, and you cannot fix a retrieval problem by changing the prompt.

Retrieval Is A Pipeline, Not A Lookup

The reason retrieval fails in so many ways is that it is a multi-stage pipeline, and each stage has its own failure surface. Treating it as a single "search the docs" step hides where the quality is actually lost.

Four Stages, Four Places To Lose Quality

1Chunking

The split decides what can be retrieved as a unit. Too small loses context; too large dilutes relevance. Chunk on natural boundaries and keep each passage self-contained.

↓

2Matching

Pure semantic similarity finds the same topic, not the answer; that is the mechanism behind pollution. Combine semantic + keyword, and filter on metadata (type, owner, recency).

↓

3Ranking

Order matters because models weigh position. A rerank that puts the most relevant passage first often beats retrieving more passages.

↓

4Freshness

A passage correct when indexed can be stale now. Carry a last-reviewed date so the model and your validation treat old information with caution.

The decision record the assistant retrieves is a concrete example: it was last reviewed eighteen months ago, and surfacing that date is what lets the reviewer flag that the "approved databases" list may itself be due for review, rather than citing it as current gospel.

Multimodal Context Raises The Debugging Cost

Context is not only text. Many real workflows combine an image, a code listing, logs, and a written question in one request. A model that accepts multiple modalities can reason across them. The gain is real: a diagram plus the code it describes plus the error logs gives the model more to work with than any one artifact alone.

Consider an incident-review variant of the assistant. An engineer attaches the architecture diagram for a service, the relevant section of its configuration, and the error logs from an outage, then asks what failed. With all three in context, the model can connect a load balancer in the diagram to a health-check setting in the config to a pattern of timeouts in the logs, an inference no single input supports on its own. That is the case for multimodal context.

⚠

The Sharper Failure Is Observability

A text-only request can be logged, replayed, and diffed; a multimodal request carries images and structured files that are expensive to store, difficult to redact, and not easily replayed against a different model. When a multimodal analysis goes wrong, reconstructing what the model actually saw is significantly harder than for a text-only equivalent.

A related failure is contradictory inputs: if the diagram shows a multi-instance setup but the config is for a single instance, the model reconciles both, usually wrongly, and reasons straight through the contradiction rather than flagging it. Label clearly which artifact is which, and add multimodal inputs only when the reasoning gain is clear and the debugging cost has been accounted for.

A practical test: if the same task can be accomplished by converting the image or file to a text description and passing that instead, do so. The text path is cheaper to log, replay, and debug.

Test Retrieval Before You Test The Model

Because so many "model" failures are really context failures, test retrieval on its own. Build this test before the end-to-end test, not after. When a full run fails, the engineer debugging it should not have to guess whether the prompt, retrieval, or model caused the problem. A passing retrieval test narrows the search immediately. Given a known query, did retrieval return the documents it should?

The test is concrete and cheap. Create queries paired with the document IDs that should be retrieved for each one, then track two numbers. Recall asks whether the right documents showed up at all. Precision asks how much of what came back was actually relevant, which is the direct measure of pollution. For the assistant, a billing-event-store query should retrieve the approved-databases record and the payment-path incident. If it also retrieves an unrelated logging-retention record, precision drops and you found the pollution source before it reached the model. Tracking these numbers turns retrieval quality from a hunch into something you can defend.

Chunking deserves the same scrutiny. A paragraph that depends on the one before it will mislead the model when retrieval surfaces it alone, so how you split documents is part of retrieval quality, not a preprocessing detail. Context has to be engineered and measured, not assumed. Most inconsistency lives in the context, and the trace of the assembled window usually tells you why.

When output quality drifts, look at what the model was shown before you look at the model.

What This Changes In How You Build

Once you treat context as an engineered layer, a few defaults change. You choose the knowledge source instead of dropping everything into the prompt. You assemble the window with order and budget in mind instead of appending until it fits. You measure retrieval with precision and recall, so context failures are diagnosed as context failures. You manage conversation history because a window that grows forever is a system that degrades as it runs. The most useful habit is simple: read the assembled context, the exact bytes the model saw, before blaming the model.

References

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
Hong, Troynikov & Huber (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. trychroma.com
Liu et al. (2023/2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
Anthropic (2025). Effective Context Engineering for AI Agents. anthropic.com
Google (2026). The New SDLC With Vibe Coding. Full paper
Osmani, Saboo & Kartakis (2026). The New Software Lifecycle. Author commentary on Google's paper. addyosmani.com
Wu et al. (2025). The Distracting Effect: Understanding Irrelevant Passages in RAG. ACL. ACL Anthology

Part 2 of Building Durable AI Systems. Most model inconsistency is context inconsistency.

DisclaimerThe views and opinions expressed here are my own and are shared for educational and discussion purposes. They do not represent the views of any past, present, or future employer, client, or organization.

Continue The Conversation

If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.

LinkedIn: Rathish Kumar B
Contact: Contact Me

For a faster response, use one of these subjects:

AI Systems
Architecture Review
Database Engineering
Platform Engineering

A few lines of context always help.

Rathish Kumar B