June 2026 ~ Rathish Kumar's Blog

Context Engineering: What the Model Sees Is the System (Part 2)

June 28, 2026 / by Rathish Kumar B

Context Engineering. Source: Rathish Kumar's Blog

Building Durable AI Systems · Part 2 of 5 · builds on the canonical overview and Part 1.

The same prompt, sent to the same model, can produce a sound recommendation one day and a confidently wrong one the next. The prompt was usually identical. The context around it was not.

Context engineering is deciding what the model gets to see before it answers. That choice often matters more than the wording of the prompt. A lot of behavior that looks like model inconsistency is really context inconsistency.

Prompt Engineering Is Interface Design (Part 1)

June 21, 2026 / by Rathish Kumar B

Prompt Engineering Is Interface Design

Building Durable AI Systems · Part 1 of 5 · builds on the canonical overview.

A prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs it accepts, the constraints it enforces, and the output it promises. The model is the implementation behind that contract, and it changes every time the model is updated.

You own and version the interface; the behavior is rented.

The Durable Prompting Loop

Contract

State the task, constraints, input boundary, and output shape.

Schema

Turn the response into data that ordinary code can validate.

Version

Store the prompt with the model it was validated against.

Test

Measure consistency across representative inputs, not one good demo.

The interface is the artifact you can review. The model behavior behind it is the dependency you have to verify.

Same Intent, Two Contracts: A Chat Question Versus A System Prompt

chat (vague)

USER

Can you review my design?
We're using MongoDB for a new
billing event store.

ASSISTANT

Sure! MongoDB is flexible and
scales well. You might consider
indexes and sharding. Overall it
can work for many use cases...
(shape and depth differ every run)

What most people type. Loose in, unpredictable out.

app (structured)

SYSTEM

You are a senior architecture reviewer.
Rules: flag single points of failure;
ask for RTO/RPO before recommending
storage; cite internal ADRs by ID.

USER

Proposal: MongoDB event store for
subscription billing, ~500/sec peak.
Return: clarifying_questions[],
risks[] (severity), recommendation.

ASSISTANT

{ "clarifying_questions": [
    "Required RTO/RPO?", "Write volume?" ],
  "risks": [ { "severity":"high",
    "description":"single instance on
    payment path", "adr_refs":["ADR-042"] } ],
  "recommendation": { "decision":"...",
    "confidence":"medium" } }

What a system needs: a role, rules, and an output contract.

The second prompt exposes the contract instead of leaving it implicit. That gives the system something to validate.

A senior engineer will object that an interface defines a contract, not guaranteed behavior. Exactly. Model behavior remains probabilistic, which is why the contract has to be explicit and the surrounding system has to validate what came back. The guarantees are weaker than a typed API, so the interface discipline matters more, not less.

The Difference Between Asking And Specifying

Getting a good answer in a chat window has more in common with asking a clear question than with building a system. The chat is forgiving: you see the output, judge it, and rephrase. A production prompt runs unattended, on inputs nobody previewed, and its output flows into code that cannot reread it for tone.

Specifying means naming the contract before the model answers: what role it is playing, which rules must hold, where the input begins, and what shape the output must take. This is the same instinct that makes you validate function arguments instead of hoping callers behave.

The Cost Of Treating Prompts Like Strings

When a prompt is treated as a string, its behavioral dependencies stay hidden. Product logic starts relying on phrasing, ordering, omitted fields, or conventions that were never named as part of a contract. That works until the prompt changes, the model is upgraded, or the input distribution shifts.

The cost does not usually show up as token spend. It shows up as retries, regressions, manual review, debugging time, and production behavior nobody can explain from a diff. Interface thinking makes the dependency visible: the prompt, model, schema, validation rules, and evaluation cases become artifacts the team can review together.

The Anatomy Of A Reliable Prompt

Guidance from the major model providers has converged on a common shape, and it is not one clever sentence. A reliable prompt is a few distinct parts, each doing one job: a role and instruction that set what the model is and what it should do, the rules and constraints it must respect, the input clearly delimited from everything else, and an explicit description of the output. Examples are optional and earn their place only when the format is hard to describe in words.

Anatomy Of A Prompt: Each Part Has One Job, And You Can Change One Without Rewriting The Rest

Role + instruction

Who the model is and the one task to do.You are a senior architecture reviewer. Assess this proposal.

Rules + constraints

What must hold and what to avoid.Flag single points of failure. Ask for RTO/RPO before recommending storage.

Input (delimited)

The data to act on, kept separate from the instructions.Proposal: MongoDB event store, ~500/sec peak.

Output contract

The exact shape you expect back, so you can validate it.clarifying_questions[], risks[] with severity, one recommendation.

Examples (optional)

One or two input→output pairs, only when the format is easier to show than to describe.

Keep instructions and data in separate parts. Mixing them is what lets a stray sentence in the input get read as a command.

The trade-off is real and worth stating plainly. More constraint buys more consistency and costs flexibility. A prompt tuned tightly for backend service specs will produce awkward output on a frontend request. You are choosing where on that line to sit, and the right place depends on how varied the real input is.

⚠

Design-time failure mode

Instructions can be individually reasonable and still conflict under specific input combinations. That failure rarely appears as a clean error in review; it appears later as inconsistent output on unusual inputs.

Structured Output And Schemas

The constraint that does the most work is the output contract. When a prompt feeds another system, prose is a liability: something downstream has to parse it, and parsing free text is where pipelines break. Structured output turns the model's response into data your code can consume directly.

Modern model APIs increasingly support schema-constrained generation and structured outputs directly. That does not remove the need for interface design; it makes the interface more enforceable. The schema is still the contract your system depends on.

A schema also gives you a deterministic place to catch failure. Consider the Architecture Review Assistant from the overview: its response is not a paragraph of advice but a structured object with clarifying questions, ranked risks tagged with severity, citations to internal decision records by ID, and one recommendation. Downstream code renders that without interpreting prose, and a missing severity or unknown record ID fails validation immediately.

In Python, that contract can be a Pydantic model. Libraries such as Instructor use that model both to steer the model toward structured output and to validate the response before your application sees it:

from enum import Enum
from pydantic import BaseModel, Field


class Severity(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"


class Confidence(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"


class Risk(BaseModel):
    description: str = Field(
        description="A concrete architectural risk in the proposal."
    )
    severity: Severity = Field(
        description="Impact level if the risk reaches production."
    )
    adr_refs: list[str] = Field(
        default_factory=list,
        description="Relevant internal ADR IDs, for example ADR-042."
    )


class Recommendation(BaseModel):
    decision: str = Field(
        description="Recommended path: approve, reject, or request changes."
    )
    confidence: Confidence = Field(
        description="Confidence in the recommendation."
    )


class ArchitectureReviewContract(BaseModel):
    clarifying_questions: list[str] = Field(
        description="Questions required before making a final architectural call."
    )
    risks: list[Risk] = Field(
        description="Ranked risks found in the proposed design."
    )
    recommendation: Recommendation

The Field(description="...") annotations are not decoration. They are part of the prompt interface: they describe the semantics of each field to the model and document the contract for the humans maintaining it.

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

review = client.chat.completions.create(
    model="your-production-model",
    response_model=ArchitectureReviewContract,
    messages=[
        {
            "role": "system",
            "content": (
                "You are a senior architecture reviewer. "
                "Flag single points of failure, ask for RTO/RPO before "
                "recommending storage, and cite internal ADRs by ID."
            ),
        },
        {
            "role": "user",
            "content": "Proposal: MongoDB event store for billing, 500/sec peak.",
        },
    ],
)

for risk in review.risks:
    route_to_review_queue(risk)  # Application code sees typed data, not prose.

That schema is both instruction and gate. A severity outside the allowed set, a recommendation with no decision, or a citation to a record ID that does not exist fails validation in plain code before anything acts on the response. Treat a schema violation as a caught error: log the input and raw output, then either retry once with a corrective prompt that names the violation, or fail to a deterministic fallback. Make the schema part of the contract from the start, not a formatting step bolted on later.

Handling Contract Violations

The model is probabilistic; the contract is not. Sooner or later the model will skip a required key, hallucinate an enum value, or return prose where the system expected a list. The architecture should catch that at the boundary and either repair it in a bounded way or fail before bad data reaches the rest of the system.

import json
from pydantic import ValidationError


class ContractViolation(RuntimeError):
    pass


def run_with_contract_retry(proposal: str, max_attempts: int = 2) -> ArchitectureReviewContract:
    messages = [
        {
            "role": "system",
            "content": (
                "Return only JSON that satisfies ArchitectureReviewContract. "
                "Do not add prose outside the JSON object."
            ),
        },
        {"role": "user", "content": proposal},
    ]
    last_error: ValidationError | None = None

    for _ in range(max_attempts):
        raw = call_model(messages)  # Your provider wrapper; returns a JSON string.

        try:
            return ArchitectureReviewContract.model_validate_json(raw)
        except ValidationError as exc:
            last_error = exc
            messages.extend([
                {"role": "assistant", "content": raw},
                {
                    "role": "user",
                    "content": (
                        "Your previous response violated the output contract. "
                        "Fix only the JSON. Validation errors:\n"
                        f"{json.dumps(exc.errors(), indent=2)}"
                    ),
                },
            ])

    raise ContractViolation("model failed to satisfy ArchitectureReviewContract") from last_error

That is the self-correcting retry loop in its smallest useful form: catch the deterministic error, feed the error back to the rented implementation, and retry within a strict budget. If the model still violates the contract, the application gets a controlled failure instead of malformed data.

A Worked Example: From A Chat Prompt To A Contract

The gap between asking and specifying is easiest to see on a real task. Suppose the team wants to generate release notes from merged pull requests. The chat version is: "Summarize these PRs into release notes." It works in the demo because a human reads the output and fixes it. In production it drifts: categories change, security fixes move around, and entry length varies. Nothing is wrong on any single run, but nothing downstream can rely on the shape.

The contract version states the parts that were left implicit. The role is a release-notes writer for a specific audience. The rules are explicit: group by change type, lead with breaking changes, one line per change, link the PR number. The input is the PR list, clearly delimited. The output is a structured object with a section per category, each containing entries that carry a title, a PR reference, and a breaking-change flag. The model still writes the prose, but it writes it inside a shape the rest of the pipeline can render, sort, and check. The behavior that mattered, breaking changes first, is now enforced by validation rather than left to the model's mood on a given call.

The operational consequence is the part worth keeping. The chat prompt has no failure signal: bad output looks like prose and flows downstream. The contract version fails loudly when the model omits a required field or invents a category, which converts a silent quality problem into a caught error.

Patterns That Scale, And When Not To Use Them

A handful of prompting patterns recur because they map to real task structure rather than to any particular model.

Zero-shot states the task and trusts the model to handle it. It is the cheapest option and the right default for tasks the model does well.
Few-shot supplies a small number of worked examples and is the durable fix when you need a specific format or a consistent style that is easier to show than to describe.¹
Decomposition breaks a complex task into steps you can inspect. Prompting a model to show intermediate reasoning improves performance on hard problems, the original chain-of-thought result.² Current reasoning models do much of that decomposition internally, so a hand-built chain is sometimes redundant. The rule that lasts: decompose externally when you need steps you can audit, test, and monitor separately, not by default. The same decomposition is a cost and latency lever; Part 5 shows the arithmetic.
Verification has a second prompt check the first one's output. It raises quality on high-stakes tasks and adds latency and cost, so it is worth it only when the failure rate of the primary prompt has been measured and judged too high. Part 4 covers how to measure that failure rate.

In practice these compose as a sequence, not a menu. Start zero-shot, add examples when format or style needs demonstration, decompose when the steps need separate auditability, and add verification only after the measured failure rate justifies it. Each step buys reliability and costs latency, so stop as soon as the task is met.

⚠

Failure mode

Over-engineering. A six-step chain with verification loops on a task that a zero-shot prompt handles fine adds latency and failure points for no gain. Match the pattern to the task, and let the model do the reasoning it already does well.

Prompts Change, So Version Them

A prompt is not written once. It changes as you discover inputs it mishandles, as the product's requirements shift, and as the model underneath is upgraded. Each of those is a change to a contract that other code depends on, which means a prompt belongs in version control or a prompt registry, not pasted into application code as a forgotten string.

from langchain import hub

# Pull a reviewed prompt artifact instead of hardcoding the template in code.
# In practice this can be LangChain Hub, LangSmith Prompt Hub, or your own registry.
prompt = hub.pull("rathish/architecture-review:1.3.0")

messages = prompt.invoke({
    "proposal": "MongoDB event store for billing, 500/sec peak",
    "required_output": ArchitectureReviewContract.model_json_schema(),
}).to_messages()

review = client.chat.completions.create(
    model="your-production-model",
    response_model=ArchitectureReviewContract,
    messages=messages,
)

The important part is not the specific registry. It is the separation of concerns: application code calls a named interface version, while prompt text, model choice, schema, owners, and evaluation results remain reviewable artifacts.

The operational consequence shows up at upgrade time. Picture the release-notes prompt running happily for months, then the provider deprecates the model it was tuned against and routes you to a newer one. Overnight the new model starts being more verbose, and the one-line-per-change rule erodes into short paragraphs. If the prompt, its version, and the model it was validated against were stored together, this is a diff and a re-run of the curated evaluation set (the golden dataset that Part 4 develops in full): you see what changed, adjust the constraint, and ship. If they were not, it is a production mystery that starts with someone noticing the release notes look off and ends with a slow reconstruction of which model is even serving traffic. Keep the prompt versioned alongside the model it was validated against, and a model upgrade becomes a reviewable change rather than a surprise.

This is also where a small amount of input discipline pays off. Delimit the input clearly from the instructions, with an explicit marker or a structured field, so the model can tell the difference between what it is supposed to do and the data it is supposed to do it to. That separation keeps a stray instruction inside the data from being read as a command, and Part 3 develops it into a full trust boundary once the model can take actions.

Testing A Prompt Means Testing For Consistency

You cannot assert that a prompt returns one exact string, because a probabilistic model will phrase the same correct answer many ways. What you can test is whether it satisfies its contract across a representative set of inputs: does it return valid structure, stay inside its constraints, and reach the right answer often enough? A prompt that is excellent on your favorite example and wrong one time in five is a defect, not a feature, and you only see that by measuring consistency across a golden dataset rather than quality on a single best case.

Concretely, the test asserts properties rather than text. Against the architecture review contract, the first layer is structural:

import pytest


@pytest.mark.parametrize("proposal", golden_architecture_review_inputs)
def test_architecture_review_contract(proposal: str) -> None:
    review = run_architecture_review(proposal)

    assert isinstance(review, ArchitectureReviewContract)
    assert review.recommendation.decision
    assert all(isinstance(r.severity, Severity) for r in review.risks)
    assert all(ref.startswith("ADR-") for r in review.risks for ref in r.adr_refs)

It says nothing about wording and everything about shape, and it fails the moment a contract regression appears. Semantic quality still needs a broader evaluation pipeline: sampled production traces, golden datasets, and sometimes an LLM-as-a-judge scoring specific properties. That deeper machinery belongs in Part 4; the interface test here is the first gate.

✓

Operational practice

Every production failure should become an evaluation case. Capture the failing input, prompt version, model version, and bad output; add it to the regression set before changing the prompt. The goal is not only to fix one incident, but to prevent the same class of failure from quietly returning.

A contract you cannot test is a contract you are only hoping holds.

The 2026 Implementation Landscape

The architecture is more important than the library, but the ecosystem has converged around a few useful ways to enforce prompt contracts in real systems:

Instructor is the minimalist Python path: define a Pydantic model, ask the model for that response shape, and get typed validation at the boundary.
LangChain / Haystack make sense when the prompt interface is one step in a larger pipeline: retrieval, routing, structured output, evaluation, and observability around the same flow.
Vercel AI SDK is the JavaScript and TypeScript version of the same idea, commonly using Zod schemas to constrain and stream structured output into applications.
BAML treats prompts and outputs as typed interface definitions across languages. It is useful when several services need to share the same AI contract without each team hand-rolling prompt glue.

The tools differ, but the durable shape is the same: declare the interface, validate the output, version the contract, and test it against real inputs.

What Changes In How You Build

Treating the prompt as an interface changes three habits. You write the output contract first and validate against it, the same way you would design an API response before its callers. You keep the prompt versioned alongside the model it was validated against, so upgrades are reviewable. And you measure the prompt's consistency rather than admiring its best output. Good prompts reduce ambiguity. Contracts, versioned and tested, are what turn that reduced ambiguity into reliability.

References

Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. arXiv:2005.14165
Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903

Part 1 of Building Durable AI Systems. The interface is the part you own; the model is the part you rent.

DisclaimerThe views and opinions expressed here are my own and are shared for educational and discussion purposes. They do not represent the views of any past, present, or future employer, client, or organization.

Continue The Conversation

If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.

LinkedIn: Rathish Kumar B
Contact: Contact Me

For a faster response, use one of these subjects:

AI Systems
Architecture Review
Database Engineering
Platform Engineering

A few lines of context always help.

Beyond Prompt Engineering: Building AI Systems That Outlast the Model

June 12, 2026 / by Rathish Kumar B

Building Durable AI Systems

Building Durable AI Systems · Canonical overview. Five deep dives follow, linked at the end.

Every few months a new model arrives, tops the benchmarks, and resets the conversation. The model is the fastest-changing layer in your stack. This is about the part that does not change: the engineering that keeps an AI feature working after the model under it has been replaced.

If you have shipped an AI feature, you have already felt the consequence: a prompt that worked well on one model returns subtly different output after a routine upgrade, and the behavior you tuned for is gone. The model will be replaced several times over the life of a system you build today, and each replacement is a test of every architectural decision you made before it arrived. Architecture strongly shapes the economics of that: the latency, cost, and reliability profile of an AI feature tend to be decided when it is designed, not when the invoice arrives or the incident fires.

That fact is the whole argument of this series. The work that survives a model swap is ordinary engineering: the contracts you put around the model, the context you assemble for it, the boundaries you place on what it can do, the way you measure its output, and the way you operate it under cost and failure pressure.

Prompt quality decides whether a demo impresses. System quality decides whether the feature is still working six months and two model versions later.

Silent degradation is not unique to AI systems, but it becomes unusually difficult to observe in them: a stale document in the retrieval corpus, a model that agrees with a flawed premise, an input shaped by injected content, each one producing well-formatted, confident output that existing monitoring tends not to catch. Three of the layers below exist as the response to exactly this. Context management keeps bad inputs from reaching the model, evaluation detects bad outputs leaving it, and observability captures what happened in between.

This article lays out the complete mental model. The five layers below are not a framework to adopt or a vendor stack to buy. They are the parts of any system built on a probabilistic component, and each one maps to an assumption that breaks when that component enters a system which traditionally exposed clearer execution boundaries. The interface is a contract that used to be implicit in a typed signature and now has to be written and owned. Knowledge is state a conventional component accesses on its own but that here must be assembled at request time. Action is intent and execution, usually coupled, that now have to be separated deliberately. Evaluation covers failures that elsewhere surface as error codes and here surface as silent degradation. Operations is the cost and reliability profile that emerges predictably elsewhere but here is shaped by upstream architectural decisions.

Five Layers Around A Replaceable Model

1Interface: The Contract You Own model lives here

A natural-language contract: inputs, constraints, output shape. The model is the implementation behind it, a replaceable component whose behavior you rent, not own.

↓

2Knowledge: What The Model Sees

What you assemble into the context window at request time: retrieved documents, prior turns, environment facts. Decided per request, not baked into the model.

↓

3Action: What It Is Allowed To Do

Tool calls that trigger effects in other systems. The model proposes intent; your infrastructure authorizes and executes. Two different jobs.

↓

4Evaluation: How You Measure Output

The layer that makes a probabilistic component operable. Without it you cannot detect a regression, justify an upgrade, or see silent degradation.

↓

5Operations: Cost, Routing, Reliability

Latency, cost, and reliability under real load. Where the architectural choices above become visible on the invoice and in the incident channel.

The five layers remain the right engineering concerns regardless of which model sits inside them. That is what makes them durable.

One Running Example

To keep this concrete rather than abstract, one system runs through every section: an Architecture Review Assistant. An engineer submits a design proposal, and the assistant reviews it the way a senior reviewer with an SRE's instincts would. It flags single points of failure, checks the design against the company's own architecture decision records, looks up the services the proposal names, asks for the numbers it needs before recommending anything, and produces a ranked risk assessment with its reasoning logged.

It is a useful example because it is realistic and unforgiving, and because it stresses every layer in the model above. The output feeds engineering decisions, so a confident wrong answer is expensive - that is an evaluation problem. It needs private company knowledge the model was never trained on - that is a knowledge problem. It has to take actions, not just talk - that is an action and trust boundary problem. And it has to keep working when the model underneath it changes - that is what the interface contract is for. Each section below is one of those problems made concrete.

Interface: The Contract You Own

The durable idea is that a prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs, the constraints, and the output you expect. The model is the implementation behind that contract, and the implementation changes every time the model is updated. You own and version the interface; the behavior is rented and can shift underneath you. That is why the contract has to be explicit and the output has to be checked rather than trusted.

In the Architecture Review Assistant, the interface is the system prompt and the output schema. The prompt fixes the reviewer's role and its non-negotiable rules: flag single points of failure, ask for recovery-time and recovery-point objectives before recommending a storage technology, and cite internal decision records by ID. The output is structured into clarifying questions, a ranked risk list, and a recommendation, so downstream code can consume it without parsing prose.

The failure that compounds quietly is prompt drift. Instructions accumulate, an edge case gets handled by adding a clause, then another, until the system prompt is several thousand tokens that no one fully understands or owns. No individual instruction is wrong, but they conflict under specific input combinations in ways that only surface in production. The operational response is prompt ownership: treat the system prompt as a versioned, reviewed artifact with a clear owner, not a running document anyone can append to. A wording change is a contract change, and it deserves the same review and regression testing you would give a change to an API signature. A prompt change that improved benchmark scores in staging once regressed a specific class of production queries the benchmark did not cover. Part 1 develops this into structured output, schemas, and prompt testing.

Knowledge: What The Model Sees At Request Time

A model knows what its training captured. It does not know your architecture decision records, last week's incident, or the service your team shipped yesterday. The durable distinction is between what the model learned during training and what it needs for this specific request, and the second is something you assemble at request time rather than something the model is given. This assembly, deciding what goes into the context window and in what order, is the layer that most directly determines output quality.

In the assistant, a request for a new subscription-billing event store retrieves two pieces of company knowledge: the decision record listing approved databases, and the incident report from the time a single database instance on the payment path breached its recovery objective during an availability-zone outage. Those retrieved passages shape the recommendation more than the wording of the question does.

The characteristic failure is retrieval pollution: a document that is on-topic but not actually relevant gets pulled into context, the model reasons confidently from it, and the wrong output is blamed on the model when the fault was retrieval. It is the most common context failure and the most consistently misdiagnosed. The operational lesson is to evaluate retrieval on its own, before any end-to-end test: given a known query, does retrieval return the right documents? Most of what looks like model inconsistency is context construction. A retrieval pipeline that looked correct in testing can silently serve a corpus that has not been updated in months - the model performs as designed on inputs that no longer reflect the system. Part 2 covers the full failure surface: knowledge sources, staleness, assembly order, and retrieval evaluation.

Action: What The Model Is Allowed To Do

When a model can call tools, it stops only producing text and starts triggering effects in other systems. The durable principle is a separation of powers: the model expresses intent by proposing a call, and the infrastructure decides whether to authorize and execute it. The model is an untrusted caller that happens to speak your API, and it should be treated with the same suspicion you would apply to any input crossing a trust boundary.

The Architecture Review Assistant has a tool that queries the internal service registry for owners, service-level agreements, and deploy history, so it can look up a service the proposal names instead of guessing from the description. The model proposes the lookup; the surrounding code validates the parameters against a schema, checks that this caller is allowed to read the registry, runs the query, and returns the result.

Those execution guardrails earn their place: a missing field, an unauthorized caller, a call outside the schema - all caught before execution. The failure they cannot catch is a model that produces a valid call because the content it was reading contained instructions designed to produce exactly that call. Correct tool, correct parameters, every authorization check passing - and the manipulation already complete at the retrieval stage, before the execution layer saw anything. This is why the trust boundary belongs at the context layer, not the execution layer. Part 3 traces a complete request lifecycle through authorization, orchestration, and bounded execution.

Evaluation: How You Measure Output

A probabilistic component cannot be tested with equality assertions, because the same input can produce different valid outputs. The durable idea is that evaluation is the layer that makes a non-deterministic component operable at all. If you cannot measure output quality, you cannot detect a regression, justify a model upgrade, or tell whether yesterday's change helped. This is the layer teams skip first and regret most.

The assistant is checked by a separate evaluator that asks structural questions of each response: was a single point of failure flagged, was a relevant decision record cited, did it ask about traffic volume before recommending storage? Mechanical checks such as valid structure and required fields run in ordinary deterministic code; only the genuinely semantic judgments use a model as the judge.

One failure mode is structural: models tend to agree with a confidently stated plan, and that behavior is measured and traceable to how they are trained, so it survives across model generations and has to be designed around rather than wished away.¹ A second is invisible to classic monitoring: the confidently wrong answer that throws no error and fires no alert looks identical to a correct one from the outside.

The failure that bites hardest at scale is different in kind: evaluation passes while production degrades. A golden dataset built from hand-selected examples will systematically miss the inputs real users send, so quality metrics stay green in CI while users experience something different. The only reliable fix is continuous sampling from production traffic to expand the evaluation set, an ongoing operational process rather than a one-time exercise. The operational lesson across all three is to separate generation from evaluation, keep mechanical checks deterministic, and treat a measured drop in output quality as an incident rather than a tuning task. Part 4 covers golden datasets, trace collection, metrics, and service-level objectives for quality.

Operations: Cost, Routing, And Reliability

Every design choice in the layers above resolves into three operational quantities: latency, cost, and reliability. These pull against each other, and pushing hard on one usually costs another, but this is a tension to engineer around rather than a fixed law that forces you to pick two. Better engineering moves the whole frontier. This is where those architectural choices become visible in production.

The assistant applies the standard levers. A small, fast model handles routing and simple classification, and the larger model is reserved for the review step that genuinely needs it.² Stable parts of the prompt are cached so only the per-request tail is recomputed. When a model call times out or returns unparseable output, a defined fallback path runs instead of failing the request. Cost here is more than the metered token bill; the larger cost is owning the system - the cognitive load of every added component, the ramp time for new engineers, and the maintenance burden of debugging non-deterministic behavior.

The failure to plan for is the pipeline that looks free in testing and is unaffordable at production scale. An equally avoidable failure is a pipeline with no degraded mode: a single provider timeout propagates into a user-facing outage because no one decided in advance what the system should do when a model call fails. Resilience requires a decision, made at design time, about what acceptable degraded behavior looks like. The operational lesson is to build a cost model and a failure budget before you build the pipeline. Part 5 turns these trade-offs into concrete decisions about routing, caching, decomposition, and degraded modes.

None of this is maintainable without clear ownership, and the questions are concrete. Who reviews a change to the system prompt? Who owns rollback when output quality degrades? Who is paged when a quality objective is breached? Who approves adding a tool to an agent's available actions? These do not need a new process; they fit the ones you already run. A prompt change is a change to a contract, so it goes through code review and change management. Quality degradation is an objective with an owner and an on-call rotation. Tool access is an authorization decision, reviewed like any other access grant. Without that ownership, scope creeps, quality degrades unnoticed, and changes ship unreviewed. These are not hypothetical risks - they are the sequence of events in most teams that built something fast and discovered six months later that nobody could explain why the output had changed.

What To Take From This

Prompt quality matters, and a well-crafted prompt is still the cheapest reliability improvement available. But the prompt is one layer of five, and the other four are where production systems are won or lost. The teams shipping AI features that keep working are not the ones who found the best phrasing. They are the ones who put a contract around the model, controlled what it sees, bounded what it can do, measured what it produces, and operated it under real cost and failure pressure. Those five disciplines are ordinary engineering applied to a new and unusually unpredictable component, and they remain true when the model underneath them is replaced.

The Five Deep Dives

Each Layer Above Gets Its Own Article

1Prompt Engineering Is Interface Design

Prompts as contracts, structured output, schemas, and prompt testing.

2Context Engineering: What the Model Sees Is the System

Knowledge sources, context assembly, retrieval, and why the same prompt gives different results.

3Tool Calling: When AI Starts Acting

The move from generation to execution, authorization, and trust boundaries.

4Evaluation Is the Missing Layer in AI Engineering

Testing non-deterministic output, golden datasets, observability, and reliability.

5Operating AI Systems in Production

Latency, cost, routing, model selection, caching, failure budgets, and degraded modes under real load.

References & Notes

Sharma et al. (2023). Towards Understanding Sycophancy in Language Models. Anthropic. arXiv:2310.13548
Chen, Zaharia & Zou (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176

Canonical overview of the series Building Durable AI Systems. The models keep changing. The engineering is the part you keep.

Continue The Conversation

If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.

LinkedIn: Rathish Kumar B
Contact: Contact Me

For a faster response, use one of these subjects:

AI Systems
Architecture Review
Database Engineering
Platform Engineering

A few lines of context always help.