![]() |
| Prompt Engineering Is Interface Design |
Building Durable AI Systems · Part 1 of 5 · builds on the canonical overview.
A prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs it accepts, the constraints it enforces, and the output it promises. The model is the implementation behind that contract, and it changes every time the model is updated.
You own and version the interface; the behavior is rented.
Can you review my design? We're using MongoDB for a new billing event store.
Sure! MongoDB is flexible and scales well. You might consider indexes and sharding. Overall it can work for many use cases... (shape and depth differ every run)
What most people type. Loose in, unpredictable out.
You are a senior architecture reviewer. Rules: flag single points of failure; ask for RTO/RPO before recommending storage; cite internal ADRs by ID.
Proposal: MongoDB event store for subscription billing, ~500/sec peak. Return: clarifying_questions[], risks[] (severity), recommendation.
{ "clarifying_questions": [
"Required RTO/RPO?", "Write volume?" ],
"risks": [ { "severity":"high",
"description":"single instance on
payment path", "adr_refs":["ADR-042"] } ],
"recommendation": { "decision":"...",
"confidence":"medium" } }What a system needs: a role, rules, and an output contract.
A senior engineer will object that an interface defines a contract, not guaranteed behavior. Exactly. Model behavior remains probabilistic, which is why the contract has to be explicit and the surrounding system has to validate what came back. The guarantees are weaker than a typed API, so the interface discipline matters more, not less.
The Difference Between Asking And Specifying
Getting a good answer in a chat window has more in common with asking a clear question than with building a system. The chat is forgiving: you see the output, judge it, and rephrase. A production prompt runs unattended, on inputs nobody previewed, and its output flows into code that cannot reread it for tone.
Specifying means naming the contract before the model answers: what role it is playing, which rules must hold, where the input begins, and what shape the output must take. This is the same instinct that makes you validate function arguments instead of hoping callers behave.
The Cost Of Treating Prompts Like Strings
When a prompt is treated as a string, its behavioral dependencies stay hidden. Product logic starts relying on phrasing, ordering, omitted fields, or conventions that were never named as part of a contract. That works until the prompt changes, the model is upgraded, or the input distribution shifts.
The cost does not usually show up as token spend. It shows up as retries, regressions, manual review, debugging time, and production behavior nobody can explain from a diff. Interface thinking makes the dependency visible: the prompt, model, schema, validation rules, and evaluation cases become artifacts the team can review together.
The Anatomy Of A Reliable Prompt
Guidance from the major model providers has converged on a common shape, and it is not one clever sentence. A reliable prompt is a few distinct parts, each doing one job: a role and instruction that set what the model is and what it should do, the rules and constraints it must respect, the input clearly delimited from everything else, and an explicit description of the output. Examples are optional and earn their place only when the format is hard to describe in words.
The trade-off is real and worth stating plainly. More constraint buys more consistency and costs flexibility. A prompt tuned tightly for backend service specs will produce awkward output on a frontend request. You are choosing where on that line to sit, and the right place depends on how varied the real input is.
Instructions can be individually reasonable and still conflict under specific input combinations. That failure rarely appears as a clean error in review; it appears later as inconsistent output on unusual inputs.
Structured Output And Schemas
The constraint that does the most work is the output contract. When a prompt feeds another system, prose is a liability: something downstream has to parse it, and parsing free text is where pipelines break. Structured output turns the model's response into data your code can consume directly.
Modern model APIs increasingly support schema-constrained generation and structured outputs directly. That does not remove the need for interface design; it makes the interface more enforceable. The schema is still the contract your system depends on.
A schema also gives you a deterministic place to catch failure. Consider the Architecture Review Assistant from the overview: its response is not a paragraph of advice but a structured object with clarifying questions, ranked risks tagged with severity, citations to internal decision records by ID, and one recommendation. Downstream code renders that without interpreting prose, and a missing severity or unknown record ID fails validation immediately.
Written out, the assistant's output contract is a small schema:
{
"clarifying_questions": ["string"],
"risks": [
{
"description": "string",
"severity": "low | medium | high | critical",
"adr_refs": ["ADR-042"]
}
],
"recommendation": { "decision": "string", "confidence": "low | medium | high" }
}That schema is both instruction and gate. A severity outside the allowed set, a recommendation with no decision, or a citation to a record ID that does not exist fails validation in plain code before anything acts on the response. Treat a schema violation as a caught error: log the input and raw output, then either retry once with a corrective prompt that names the violation, or fail to a deterministic fallback. Make the schema part of the contract from the start, not a formatting step bolted on later.
A Worked Example: From A Chat Prompt To A Contract
The gap between asking and specifying is easiest to see on a real task. Suppose the team wants to generate release notes from merged pull requests. The chat version is: "Summarize these PRs into release notes." It works in the demo because a human reads the output and fixes it. In production it drifts: categories change, security fixes move around, and entry length varies. Nothing is wrong on any single run, but nothing downstream can rely on the shape.
The contract version states the parts that were left implicit. The role is a release-notes writer for a specific audience. The rules are explicit: group by change type, lead with breaking changes, one line per change, link the PR number. The input is the PR list, clearly delimited. The output is a structured object with a section per category, each containing entries that carry a title, a PR reference, and a breaking-change flag. The model still writes the prose, but it writes it inside a shape the rest of the pipeline can render, sort, and check. The behavior that mattered, breaking changes first, is now enforced by validation rather than left to the model's mood on a given call.
The operational consequence is the part worth keeping. The chat prompt has no failure signal: bad output looks like prose and flows downstream. The contract version fails loudly when the model omits a required field or invents a category, which converts a silent quality problem into a caught error.
Patterns That Scale, And When Not To Use Them
A handful of prompting patterns recur because they map to real task structure rather than to any particular model.
- Zero-shot states the task and trusts the model to handle it. It is the cheapest option and the right default for tasks the model does well.
- Few-shot supplies a small number of worked examples and is the durable fix when you need a specific format or a consistent style that is easier to show than to describe.1
- Decomposition breaks a complex task into steps you can inspect. Prompting a model to show intermediate reasoning improves performance on hard problems, the original chain-of-thought result.2 Current reasoning models do much of that decomposition internally, so a hand-built chain is sometimes redundant. The rule that lasts: decompose externally when you need steps you can audit, test, and monitor separately, not by default. The same decomposition is a cost and latency lever; Part 5 shows the arithmetic.
- Verification has a second prompt check the first one's output. It raises quality on high-stakes tasks and adds latency and cost, so it is worth it only when the failure rate of the primary prompt has been measured and judged too high. Part 4 covers how to measure that failure rate.
In practice these compose as a sequence, not a menu. Start zero-shot, add examples when format or style needs demonstration, decompose when the steps need separate auditability, and add verification only after the measured failure rate justifies it. Each step buys reliability and costs latency, so stop as soon as the task is met.
Over-engineering. A six-step chain with verification loops on a task that a zero-shot prompt handles fine adds latency and failure points for no gain. Match the pattern to the task, and let the model do the reasoning it already does well.
Prompts Change, So Version Them
A prompt is not written once. It changes as you discover inputs it mishandles, as the product's requirements shift, and as the model underneath is upgraded. Each of those is a change to a contract that other code depends on, which means a prompt belongs in version control with a history, not pasted into a configuration field and forgotten.
The operational consequence shows up at upgrade time. Picture the release-notes prompt running happily for months, then the provider deprecates the model it was tuned against and routes you to a newer one. Overnight the new model starts being more verbose, and the one-line-per-change rule erodes into short paragraphs. If the prompt, its version, and the model it was validated against were stored together, this is a diff and a re-run of the curated evaluation set (the golden dataset that Part 4 develops in full): you see what changed, adjust the constraint, and ship. If they were not, it is a production mystery that starts with someone noticing the release notes look off and ends with a slow reconstruction of which model is even serving traffic. Keep the prompt versioned alongside the model it was validated against, and a model upgrade becomes a reviewable change rather than a surprise.
This is also where a small amount of input discipline pays off. Delimit the input clearly from the instructions, with an explicit marker or a structured field, so the model can tell the difference between what it is supposed to do and the data it is supposed to do it to. That separation keeps a stray instruction inside the data from being read as a command, and Part 3 develops it into a full trust boundary once the model can take actions.
Testing A Prompt Means Testing For Consistency
You cannot assert that a prompt returns one exact string, because a probabilistic model will phrase the same correct answer many ways. What you can test is whether it satisfies its contract across a representative set of inputs: does it return valid structure, stay inside its constraints, and reach the right answer often enough? A prompt that is excellent on your favorite example and wrong one time in five is a defect, not a feature, and you only see that by measuring consistency across a golden dataset rather than quality on a single best case.
Concretely, the test asserts properties rather than text. Against the release-notes schema, that is one structural check:
notes = run_prompt(release_notes_prompt, merged_prs) assert is_valid_schema(notes) # parses, required fields present assert all(isinstance(e["breaking"], bool) # the flag is a boolean, not prose for s in notes["sections"] for e in s["entries"])
It says nothing about wording and everything about shape, and it fails the moment a contract regression appears.
Every production failure should become an evaluation case. Capture the failing input, prompt version, model version, and bad output; add it to the regression set before changing the prompt. The goal is not only to fix one incident, but to prevent the same class of failure from quietly returning.
This article stops at contract testing: confirm that the prompt honors the interface it claims to provide. The broader machinery of golden datasets, scoring, and an independent evaluator is the subject of Part 4.
A contract you cannot test is a contract you are only hoping holds.
What Changes In How You Build
Treating the prompt as an interface changes three habits. You write the output contract first and validate against it, the same way you would design an API response before its callers. You keep the prompt versioned alongside the model it was validated against, so upgrades are reviewable. And you measure the prompt's consistency rather than admiring its best output. Good prompts reduce ambiguity. Contracts, versioned and tested, are what turn that reduced ambiguity into reliability.
References
- Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. arXiv:2005.14165
- Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
Part 1 of Building Durable AI Systems. The interface is the part you own; the model is the part you rent.
Continue The Conversation
If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.
LinkedIn: Rathish Kumar B
Contact: Contact Me
For a faster response, use one of these subjects:
- AI Systems
- Architecture Review
- Database Engineering
- Platform Engineering
A few lines of context always help.
