![]() |
| Building Durable AI Systems |
Building Durable AI Systems · Canonical overview. Five deep dives follow, linked at the end.
Every few months a new model arrives, tops the benchmarks, and resets the conversation. The model is the fastest-changing layer in your stack. This is about the part that does not change: the engineering that keeps an AI feature working after the model under it has been replaced.
If you have shipped an AI feature, you have already felt the consequence: a prompt that worked well on one model returns subtly different output after a routine upgrade, and the behavior you tuned for is gone. The model will be replaced several times over the life of a system you build today, and each replacement is a test of every architectural decision you made before it arrived. Architecture strongly shapes the economics of that: the latency, cost, and reliability profile of an AI feature tend to be decided when it is designed, not when the invoice arrives or the incident fires.
That fact is the whole argument of this series. The work that survives a model swap is ordinary engineering: the contracts you put around the model, the context you assemble for it, the boundaries you place on what it can do, the way you measure its output, and the way you operate it under cost and failure pressure.
Prompt quality decides whether a demo impresses. System quality decides whether the feature is still working six months and two model versions later.
Silent degradation is not unique to AI systems, but it becomes unusually difficult to observe in them: a stale document in the retrieval corpus, a model that agrees with a flawed premise, an input shaped by injected content, each one producing well-formatted, confident output that existing monitoring tends not to catch. Three of the layers below exist as the response to exactly this. Context management keeps bad inputs from reaching the model, evaluation detects bad outputs leaving it, and observability captures what happened in between.
This article lays out the complete mental model. The five layers below are not a framework to adopt or a vendor stack to buy. They are the parts of any system built on a probabilistic component, and each one maps to an assumption that breaks when that component enters a system which traditionally exposed clearer execution boundaries. The interface is a contract that used to be implicit in a typed signature and now has to be written and owned. Knowledge is state a conventional component accesses on its own but that here must be assembled at request time. Action is intent and execution, usually coupled, that now have to be separated deliberately. Evaluation covers failures that elsewhere surface as error codes and here surface as silent degradation. Operations is the cost and reliability profile that emerges predictably elsewhere but here is shaped by upstream architectural decisions.
One Running Example
To keep this concrete rather than abstract, one system runs through every section: an Architecture Review Assistant. An engineer submits a design proposal, and the assistant reviews it the way a senior reviewer with an SRE's instincts would. It flags single points of failure, checks the design against the company's own architecture decision records, looks up the services the proposal names, asks for the numbers it needs before recommending anything, and produces a ranked risk assessment with its reasoning logged.
It is a useful example because it is realistic and unforgiving, and because it stresses every layer in the model above. The output feeds engineering decisions, so a confident wrong answer is expensive - that is an evaluation problem. It needs private company knowledge the model was never trained on - that is a knowledge problem. It has to take actions, not just talk - that is an action and trust boundary problem. And it has to keep working when the model underneath it changes - that is what the interface contract is for. Each section below is one of those problems made concrete.
Interface: The Contract You Own
The durable idea is that a prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs, the constraints, and the output you expect. The model is the implementation behind that contract, and the implementation changes every time the model is updated. You own and version the interface; the behavior is rented and can shift underneath you. That is why the contract has to be explicit and the output has to be checked rather than trusted.
In the Architecture Review Assistant, the interface is the system prompt and the output schema. The prompt fixes the reviewer's role and its non-negotiable rules: flag single points of failure, ask for recovery-time and recovery-point objectives before recommending a storage technology, and cite internal decision records by ID. The output is structured into clarifying questions, a ranked risk list, and a recommendation, so downstream code can consume it without parsing prose.
The failure that compounds quietly is prompt drift. Instructions accumulate, an edge case gets handled by adding a clause, then another, until the system prompt is several thousand tokens that no one fully understands or owns. No individual instruction is wrong, but they conflict under specific input combinations in ways that only surface in production. The operational response is prompt ownership: treat the system prompt as a versioned, reviewed artifact with a clear owner, not a running document anyone can append to. A wording change is a contract change, and it deserves the same review and regression testing you would give a change to an API signature. A prompt change that improved benchmark scores in staging once regressed a specific class of production queries the benchmark did not cover. Part 1 develops this into structured output, schemas, and prompt testing.
Knowledge: What The Model Sees At Request Time
A model knows what its training captured. It does not know your architecture decision records, last week's incident, or the service your team shipped yesterday. The durable distinction is between what the model learned during training and what it needs for this specific request, and the second is something you assemble at request time rather than something the model is given. This assembly, deciding what goes into the context window and in what order, is the layer that most directly determines output quality.
In the assistant, a request for a new subscription-billing event store retrieves two pieces of company knowledge: the decision record listing approved databases, and the incident report from the time a single database instance on the payment path breached its recovery objective during an availability-zone outage. Those retrieved passages shape the recommendation more than the wording of the question does.
The characteristic failure is retrieval pollution: a document that is on-topic but not actually relevant gets pulled into context, the model reasons confidently from it, and the wrong output is blamed on the model when the fault was retrieval. It is the most common context failure and the most consistently misdiagnosed. The operational lesson is to evaluate retrieval on its own, before any end-to-end test: given a known query, does retrieval return the right documents? Most of what looks like model inconsistency is context construction. A retrieval pipeline that looked correct in testing can silently serve a corpus that has not been updated in months - the model performs as designed on inputs that no longer reflect the system. Part 2 covers the full failure surface: knowledge sources, staleness, assembly order, and retrieval evaluation.
Action: What The Model Is Allowed To Do
When a model can call tools, it stops only producing text and starts triggering effects in other systems. The durable principle is a separation of powers: the model expresses intent by proposing a call, and the infrastructure decides whether to authorize and execute it. The model is an untrusted caller that happens to speak your API, and it should be treated with the same suspicion you would apply to any input crossing a trust boundary.
The Architecture Review Assistant has a tool that queries the internal service registry for owners, service-level agreements, and deploy history, so it can look up a service the proposal names instead of guessing from the description. The model proposes the lookup; the surrounding code validates the parameters against a schema, checks that this caller is allowed to read the registry, runs the query, and returns the result.
Those execution guardrails earn their place: a missing field, an unauthorized caller, a call outside the schema - all caught before execution. The failure they cannot catch is a model that produces a valid call because the content it was reading contained instructions designed to produce exactly that call. Correct tool, correct parameters, every authorization check passing - and the manipulation already complete at the retrieval stage, before the execution layer saw anything. This is why the trust boundary belongs at the context layer, not the execution layer. Part 3 traces a complete request lifecycle through authorization, orchestration, and bounded execution.
Evaluation: How You Measure Output
A probabilistic component cannot be tested with equality assertions, because the same input can produce different valid outputs. The durable idea is that evaluation is the layer that makes a non-deterministic component operable at all. If you cannot measure output quality, you cannot detect a regression, justify a model upgrade, or tell whether yesterday's change helped. This is the layer teams skip first and regret most.
The assistant is checked by a separate evaluator that asks structural questions of each response: was a single point of failure flagged, was a relevant decision record cited, did it ask about traffic volume before recommending storage? Mechanical checks such as valid structure and required fields run in ordinary deterministic code; only the genuinely semantic judgments use a model as the judge.
One failure mode is structural: models tend to agree with a confidently stated plan, and that behavior is measured and traceable to how they are trained, so it survives across model generations and has to be designed around rather than wished away.1 A second is invisible to classic monitoring: the confidently wrong answer that throws no error and fires no alert looks identical to a correct one from the outside.
The failure that bites hardest at scale is different in kind: evaluation passes while production degrades. A golden dataset built from hand-selected examples will systematically miss the inputs real users send, so quality metrics stay green in CI while users experience something different. The only reliable fix is continuous sampling from production traffic to expand the evaluation set, an ongoing operational process rather than a one-time exercise. The operational lesson across all three is to separate generation from evaluation, keep mechanical checks deterministic, and treat a measured drop in output quality as an incident rather than a tuning task. Part 4 covers golden datasets, trace collection, metrics, and service-level objectives for quality.
Operations: Cost, Routing, And Reliability
Every design choice in the layers above resolves into three operational quantities: latency, cost, and reliability. These pull against each other, and pushing hard on one usually costs another, but this is a tension to engineer around rather than a fixed law that forces you to pick two. Better engineering moves the whole frontier. This is where those architectural choices become visible in production.
The assistant applies the standard levers. A small, fast model handles routing and simple classification, and the larger model is reserved for the review step that genuinely needs it.2 Stable parts of the prompt are cached so only the per-request tail is recomputed. When a model call times out or returns unparseable output, a defined fallback path runs instead of failing the request. Cost here is more than the metered token bill; the larger cost is owning the system - the cognitive load of every added component, the ramp time for new engineers, and the maintenance burden of debugging non-deterministic behavior.
The failure to plan for is the pipeline that looks free in testing and is unaffordable at production scale. An equally avoidable failure is a pipeline with no degraded mode: a single provider timeout propagates into a user-facing outage because no one decided in advance what the system should do when a model call fails. Resilience requires a decision, made at design time, about what acceptable degraded behavior looks like. The operational lesson is to build a cost model and a failure budget before you build the pipeline. Part 5 turns these trade-offs into concrete decisions about routing, caching, decomposition, and degraded modes.
None of this is maintainable without clear ownership, and the questions are concrete. Who reviews a change to the system prompt? Who owns rollback when output quality degrades? Who is paged when a quality objective is breached? Who approves adding a tool to an agent's available actions? These do not need a new process; they fit the ones you already run. A prompt change is a change to a contract, so it goes through code review and change management. Quality degradation is an objective with an owner and an on-call rotation. Tool access is an authorization decision, reviewed like any other access grant. Without that ownership, scope creeps, quality degrades unnoticed, and changes ship unreviewed. These are not hypothetical risks - they are the sequence of events in most teams that built something fast and discovered six months later that nobody could explain why the output had changed.
What To Take From This
Prompt quality matters, and a well-crafted prompt is still the cheapest reliability improvement available. But the prompt is one layer of five, and the other four are where production systems are won or lost. The teams shipping AI features that keep working are not the ones who found the best phrasing. They are the ones who put a contract around the model, controlled what it sees, bounded what it can do, measured what it produces, and operated it under real cost and failure pressure. Those five disciplines are ordinary engineering applied to a new and unusually unpredictable component, and they remain true when the model underneath them is replaced.
The Five Deep Dives
References & Notes
- Sharma et al. (2023). Towards Understanding Sycophancy in Language Models. Anthropic. arXiv:2310.13548
- Chen, Zaharia & Zou (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176
Canonical overview of the series Building Durable AI Systems. The models keep changing. The engineering is the part you keep.
Continue The Conversation
If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.
LinkedIn: linkedin.com/in/rathishkb
Contact: rathishkumar.in/p/contact-me.html
For a faster response, use one of these subjects:
- AI Systems
- Architecture Review
- Database Engineering
- Platform Engineering
A few lines of context always help.
