Turn a grounded prompt into a reliable API boundary with server-side secrets, typed results, bounded retries, safe actions, and useful telemetry.
You can test the stale-key prompt from the previous lesson and get the right answer once. That doesn't make it a production feature.
A production large language model (LLM) call has to survive slow networks, invalid outputs, user cancellation, private data, and retries. Suppose Luna's support screen asks whether stale service-account key acct_10234 is eligible under policy line P-7. The answer must arrive through a boundary that protects credentials, validates evidence, records failure, and keeps rotation-job creation separate. Build that boundary.
Application boundary: Put the model call behind one small application boundary. The rest of the app calls your boundary, not a provider SDK (Software Development Kit) directly.
A hosted model request usually uses an API key or another server-side credential. If you paste it into source code or send it to a browser, anyone who obtains it may be able to make requests against your account. Keep provider keys out of browser code, mobile clients, repositories, and logs.
Locally, read a credential from an environment variable. In a deployed service, use the platform's secret manager and inject the value only into the server process. Don't put a provider key in a public frontend variable.
The lab uses a placeholder rather than a real key and proves only that server configuration is present.
1import os
2
3os.environ.setdefault("LLM_API_KEY", "dev_key_not_a_real_secret")
4
5api_key = os.environ.get("LLM_API_KEY", "")
6if len(api_key) < 12:
7 raise RuntimeError("LLM_API_KEY is not set")
8
9print({"loaded": True, "server_only": True})1{'loaded': True, 'server_only': True}This pattern avoids printing even part of the secret while still proving that server configuration is present. A real app should also separate development and production credentials and rotate a key after any suspected leak.
Treat an LLM API call like a typed task crossing a service boundary.
You don't paste a private support thread into an arbitrary chat box and hope the answer is safe. You send a typed task contract with an account ID, requested decision, policy constraints, deadline, and maybe a "run once" marker. When the result comes back, code checks that it matches the request before committing it.
An LLM request needs the same idea:
| Contract field | LLM app equivalent | Why it matters |
|---|---|---|
| Task name | Request schema | The app knows what it asked for. |
| Account ID | User or task ID | Logs can trace the result. |
| Data handling rule | Privacy rule | Private details aren't logged by default. |
| Deadline | Timeout | The UI doesn't hang forever. |
| Duplicate rotation job check | Idempotency key | A retry doesn't create duplicate side effects. |
| Policy check | Response validation | Unsupported decisions don't enter the product. |
The model sees text and tools. Your application sees contracts. That separation is the first professional habit.
Start by naming what your app owns.
A provider request may use messages, tools, response formats, model IDs, and sampling settings. Those fields differ across providers and change over time. Keep those details out of route handlers.
Instead, define a small internal request.
1from dataclasses import dataclass
2from typing import Literal
3
4@dataclass(frozen=True)
5class LLMTask:
6 task: Literal["decide_rotation", "summarize_scan", "draft_reply"]
7 account_id: str
8 policy_line_ids: tuple[str, ...]
9 customer_question: str
10 trace_id: str
11 prompt_version: str
12 timeout_seconds: float = 20.0
13 max_retries: int = 2
14
15task = LLMTask(
16 task="decide_rotation",
17 account_id="acct_10234",
18 policy_line_ids=("P-7",),
19 customer_question="Can I rotate the stale service-account key?",
20 trace_id="trace_acct_10234_01",
21 prompt_version="rotation_decision@2",
22)
23
24print(task.task, task.account_id, task.policy_line_ids)1decide_rotation acct_10234 ('P-7',)This object doesn't know whether you use OpenAI, Anthropic, a local model, or a gateway.
It describes application intent. Notice what it doesn't contain: permission to create a rotation job or grant approval. Deciding eligibility and executing an action are separate boundaries.
The wrapper owns the parts that must be consistent across the product:
| Concern | Wrapper responsibility |
|---|---|
| Model selection | Choose a model or route based on task. |
| Prompt version | Attach the prompt template version used for this call. |
| Timeout | Stop waiting before the user experience breaks. |
| Retry policy | Retry only safe errors, with a cap. |
| Validation | Parse and validate output before returning it. |
| Telemetry | Log latency, tokens, status, and failure reason. |
| Privacy | Redact or avoid logging customer fields. |
If route handlers call the provider directly, these rules drift: one route retries five times while another never retries; one logs raw prompts while another logs nothing; one validates JSON while another trusts a string. That inconsistency becomes technical debt fast.
The provider adapter can turn the application contract into its own request payload. The teaching payload below is deliberately provider-shaped rather than SDK-specific: a real adapter maps its messages and required fields into the schema format its provider supports. Keep that translation testable too:
1import json
2
3policy_lines = {"P-7": "Stale service-account keys must rotate within 30 days."}
4task = {
5 "account_id": "acct_10234",
6 "credential_age_days": 12,
7 "key_stale": True,
8 "policy_line_ids": ["P-7"],
9}
10
11context = "\n".join(
12 f"{line_id}: {policy_lines[line_id]}" for line_id in task["policy_line_ids"]
13)
14payload = {
15 "messages": [
16 {"role": "system", "content": "Decide rotation eligibility only from supplied policy lines. Return JSON."},
17 {"role": "user", "content": f"{context}\nAccount: {json.dumps(task, sort_keys=True)}"},
18 ],
19 "required_response_fields": ["decision", "rotation_window_days", "source_line_ids"],
20}
21
22assert "P-7" in payload["messages"][1]["content"]
23assert "acct_10234" in payload["messages"][1]["content"]
24print(json.dumps(payload, indent=2))1{
2 "messages": [
3 {
4 "role": "system",
5 "content": "Decide rotation eligibility only from supplied policy lines. Return JSON."
6 },
7 {
8 "role": "user",
9 "content": "P-7: Stale service-account keys must rotate within 30 days.\nAccount: {\"account_id\": \"acct_10234\", \"credential_age_days\": 12, \"key_stale\": true, \"policy_line_ids\": [\"P-7\"]}"
10 }
11 ],
12 "required_response_fields": [
13 "decision",
14 "rotation_window_days",
15 "source_line_ids"
16 ]
17}The flow below is provider-agnostic. In OpenAI's API, function calling and Structured Outputs are provider features that can express tools or output schemas; application validation and action authorization still belong outside the model response.[1][2]
Notice the two outputs:
Production work needs both.
The next snippet uses a fake provider function so you can run the shape without credentials. It carries the exact policy decision from the previous lesson through parse, business validation, logging, and return.
1import json
2import time
3from dataclasses import dataclass
4from typing import Any
5
6@dataclass(frozen=True)
7class RotationDecision:
8 decision: str
9 rotation_window_days: int
10 source_line_ids: tuple[str, ...]
11
12def fake_provider_call(prompt: str) -> str:
13 return json.dumps({
14 "decision": "eligible",
15 "rotation_window_days": 30,
16 "source_line_ids": ["P-7"],
17 })
18
19def parse_decision(raw: str, policy_windows: dict[str, int]) -> RotationDecision:
20 data: dict[str, Any] = json.loads(raw)
21 if data.get("decision") not in {"eligible", "not_eligible", "not_specified"}:
22 raise ValueError("decision is outside allowed enum")
23 if not isinstance(data.get("rotation_window_days"), int):
24 raise ValueError("rotation_window_days must be an integer")
25 if not isinstance(data.get("source_line_ids"), list):
26 raise ValueError("source_line_ids must be a list")
27 cited = tuple(data["source_line_ids"])
28 if cited != ("P-7",) or data["rotation_window_days"] != policy_windows["P-7"]:
29 raise ValueError("decision is unsupported by supplied policy")
30 return RotationDecision(
31 decision=data["decision"],
32 rotation_window_days=data["rotation_window_days"],
33 source_line_ids=cited,
34 )
35
36def decide_rotation(account_id: str, trace_id: str, now_fn=time.perf_counter) -> RotationDecision:
37 start = now_fn()
38 raw = fake_provider_call(f"Decide stale-key rotation eligibility for {account_id} using P-7.")
39 decision = parse_decision(raw, {"P-7": 30})
40 latency_ms = (now_fn() - start) * 1000
41 print({"trace_id": trace_id, "status": "ok", "latency_ms": round(latency_ms, 2)})
42 return decision
43
44fake_times = iter([100.0, 100.00005])
45decision = decide_rotation(
46 "acct_10234",
47 "trace_acct_10234_01",
48 now_fn=lambda: next(fake_times),
49)
50print(decision)1{'trace_id': 'trace_acct_10234_01', 'status': 'ok', 'latency_ms': 0.05}
2RotationDecision(decision='eligible', rotation_window_days=30, source_line_ids=('P-7',))This example is small, but it already has three production habits:
Each network call needs a timeout.
Without a timeout, one slow provider request can hold a web worker, keep a browser spinner open, and make the app look broken. A beginner often sets an overlong timeout to avoid errors. That hides the problem instead of solving it.
Use a timeout that matches the product moment:
| Product moment | Timeout shape |
|---|---|
| Autocomplete | Very short, fail quietly. |
| Chat reply | Short enough to keep user trust, maybe stream partial output. |
| Document summary | Longer, but show progress or move to a job. |
| Batch enrichment | Background job with retry and checkpointing. |
This runnable timeout keeps the user-facing deadline outside the fake provider. A slow response becomes a named failure state rather than an endless spinner:
1import asyncio
2
3async def slow_provider() -> str:
4 await asyncio.sleep(0.02)
5 return "late decision"
6
7async def call_with_deadline(timeout_seconds: float) -> dict[str, str]:
8 try:
9 text = await asyncio.wait_for(slow_provider(), timeout=timeout_seconds)
10 return {"status": "ok", "text": text}
11 except TimeoutError:
12 return {"status": "timeout", "next_step": "show_retry_option"}
13
14print(asyncio.run(call_with_deadline(0.001)))1{'status': 'timeout', 'next_step': 'show_retry_option'}Retries need more care.
Retrying a temporary network error can be appropriate for a read-only decision call. Retrying an action after a timeout is different: the server may have created a rotation job even though the client never received the response. If the workflow can call a tool or update a database, enforce idempotency at that action boundary.
Not every error should be retried. A status code alone may not classify the failure: a provider can use the same code for a temporary rate limit and exhausted quota. Inspect provider error details too. Authentication errors, quota exhaustion, and malformed requests need a fix, not more traffic:
1def retry_decision(status_code: int, error_kind: str = "") -> str:
2 if status_code == 429 and error_kind == "quota_exhausted":
3 return "fail_and_fix_quota"
4 if status_code in {408, 429, 500, 502, 503, 504}:
5 return "retry_with_backoff"
6 if status_code in {400, 401, 403}:
7 return "fail_and_fix_request"
8 return "fail_and_inspect"
9
10cases = [(429, "rate_limit"), (429, "quota_exhausted"), (503, ""), (401, ""), (400, "")]
11for status, error_kind in cases:
12 print(status, error_kind or "-", retry_decision(status, error_kind))1429 rate_limit retry_with_backoff
2429 quota_exhausted fail_and_fix_quota
3503 - retry_with_backoff
4401 - fail_and_fix_request
5400 - fail_and_fix_requestA good retry policy adds delay between attempts. If retries fire immediately, a brief provider outage becomes a thundering herd from your own servers. Exponential backoff waits a little longer each time: a base near 1 second, then 2, then 4. Backoff alone isn't enough. If a thousand workers all back off by exactly 1, 2, and 4 seconds, they still retry in synchronized waves. Adding random jitter spreads those retries across the window so the herd disperses. The common form is full jitter: pick a random delay between zero and the current exponential ceiling.[3]
Before you run the next snippet, predict the ceiling for attempt 1 and attempt 2.
This fake provider fails on the first two calls, then succeeds. The wrapper computes a longer ceiling each time it retries, then samples a jittered delay below it. In this testable version we inject a no-op sleep function and a fixed random source so the output stays deterministic while you run it.
1import time
2
3class RetryableModelError(Exception):
4 pass
5
6def flaky_call(attempt: int) -> str:
7 if attempt < 2:
8 raise RetryableModelError("temporary rate limit")
9 return "ok"
10
11def call_with_retry(max_retries: int = 2, sleep_fn=time.sleep, rand_fn=None) -> str:
12 rand_fn = rand_fn or (lambda: 0.5)
13 attempts = 0
14 while True:
15 try:
16 return flaky_call(attempts)
17 except RetryableModelError as exc:
18 if attempts >= max_retries:
19 print({"status": "failed", "reason": str(exc)})
20 raise
21 attempts += 1
22 ceiling = 2 ** (attempts - 1) # 1s, 2s, 4s...
23 delay = round(rand_fn() * ceiling, 2) # full jitter: [0, ceiling)
24 print({"status": "retrying", "attempt": attempts, "ceiling_s": ceiling, "delay_s": delay})
25 sleep_fn(delay)
26
27result = call_with_retry(sleep_fn=lambda _seconds: None)
28print(result)1{'status': 'retrying', 'attempt': 1, 'ceiling_s': 1, 'delay_s': 0.5}
2{'status': 'retrying', 'attempt': 2, 'ceiling_s': 2, 'delay_s': 1.0}
3okAlso log the final status. A retry that succeeds is still operationally important. A provider SDK may implement retry behavior, but your application still owns timeout, retry cap, and idempotency decision.
The eligibility decision is read-only. Creating a rotation job isn't. The action below stores an idempotency key before returning a job ID. If the network drops after the first call, retrying the same key returns the existing rotation job rather than creating another one.
1created_rotation_jobs: dict[str, str] = {}
2
3def create_rotation_job_once(account_id: str, idempotency_key: str) -> tuple[str, bool]:
4 if idempotency_key in created_rotation_jobs:
5 return created_rotation_jobs[idempotency_key], False
6 job_id = f"job_{account_id}_001"
7 created_rotation_jobs[idempotency_key] = job_id
8 return job_id, True
9
10key = "create_rotation_job:acct_10234:P-7:v1"
11first = create_rotation_job_once("acct_10234", key)
12retry_after_timeout = create_rotation_job_once("acct_10234", key)
13
14assert first[0] == retry_after_timeout[0]
15assert len(created_rotation_jobs) == 1
16print({"first": first, "retry_after_timeout": retry_after_timeout})1{'first': ('job_acct_10234_001', True), 'retry_after_timeout': ('job_acct_10234_001', False)}The dictionary makes sequential retries visible, but production code needs a durable, atomic claim. Store the idempotency key in a database row protected by a unique constraint, or use a provider-supported idempotency mechanism. Otherwise concurrent retries can both pass the first check and create duplicates.
Free-form text is useful for humans. Applications often need objects.
If the app asks for a rotation decision, downstream code wants fields such as decision, rotation_window_days, and source_line_ids. In OpenAI's API, Structured Outputs with a supported strict JSON schema enforces supported schema adherence, while JSON mode targets valid JSON and still has documented edge cases that callers must handle.[2] Neither mode proves an answer follows workspace policy. Your application still handles refusals or incomplete output and checks facts and permissions.
| Field | Schema check | Business check |
|---|---|---|
decision | Must be an allowed enum. | eligible needs a cited applicable rule. |
rotation_window_days | Must be an integer. | Must equal the window in policy P-7. |
source_line_ids | Must be an array of strings. | Every ID must exist in supplied evidence. |
The tempting shortcut is stopping once json.loads succeeds. Run both candidates below: both are valid JSON, but one contradicts policy P-7.
1import json
2
3policy_windows = {"P-7": 30}
4candidates = [
5 '{"decision":"eligible","rotation_window_days":30,"source_line_ids":["P-7"]}',
6 '{"decision":"eligible","rotation_window_days":90,"source_line_ids":["P-7"]}',
7]
8
9def validate(raw: str) -> str:
10 data = json.loads(raw)
11 source_id = data["source_line_ids"][0]
12 if data["rotation_window_days"] != policy_windows.get(source_id):
13 return "REJECT unsupported rotation window"
14 return "ACCEPT grounded decision"
15
16for candidate in candidates:
17 print(validate(candidate))1ACCEPT grounded decision
2REJECT unsupported rotation windowThe second object parses correctly and still fails. Shape validation and business validation solve different problems.
Streaming sends partial output as it's generated.
Use it when the user is waiting and partial text is useful: drafting a customer-facing explanation, a long answer, code generation, or voice. Don't make streaming your default execution mode. Eligibility extraction often works better as a normal request because the app needs one final validated object.
| Use streaming | Prefer non-streaming |
|---|---|
| Human reads text as it appears. | App needs one JSON object. |
| Long answer improves perceived latency. | Task is short and hidden. |
| User may stop generation. | Result must validate before display. |
| UI can handle partial tokens. | Caller wants one atomic success or failure. |
Streaming adds state:
started, streaming, completed, or failedIf streaming output later fails validation, the UI needs a plan. For a draft reply, you may show partial text and mark it unapproved. For a rotation-job action, hold the action until final validation passes.
A tiny streaming handler can expose a draft while keeping action authorization outside the text stream:
1from collections.abc import Iterator
2
3def fake_stream() -> Iterator[str]:
4 chunks = ["Account ", "acct_10234 ", "is eligible ", "under P-7."]
5 for chunk in chunks:
6 yield chunk
7
8buffer = ""
9for chunk in fake_stream():
10 buffer += chunk
11 print({"event": "delta", "preview": buffer})
12
13assert "rotation job" not in buffer.lower()
14print({"event": "completed", "display_only": True, "text": buffer})1{'event': 'delta', 'preview': 'Account '}
2{'event': 'delta', 'preview': 'Account acct_10234 '}
3{'event': 'delta', 'preview': 'Account acct_10234 is eligible '}
4{'event': 'delta', 'preview': 'Account acct_10234 is eligible under P-7.'}
5{'event': 'completed', 'display_only': True, 'text': 'Account acct_10234 is eligible under P-7.'}Provider streaming endpoints commonly send incremental events over a persistent HTTP response. Your job is to buffer them, handle cancellation, and validate the final assembled result before you act on it.
You can't manage what you don't measure.
Each call records enough information to debug, compare, and budget:
| Field | Example | Why it helps |
|---|---|---|
trace_id | trace_acct_10234_01 | Connects app logs to provider call. |
task | decide_rotation | Groups cost by feature. |
model | provider-model-id | Finds regressions after model changes. |
prompt_version | rotation_decision@2 | Connects output to prompt history. |
latency_ms | 840 | Tracks user experience. |
input_tokens | 921 | Explains cost and context size. |
output_tokens | 84 | Finds verbose outputs. |
status | ok, timeout, invalid_output | Supports dashboards and alerts. |
Hosted APIs may charge for input and output tokens. Rates depend on provider and model, so keep the calculation separate from current pricing. With illustrative rates of $2 per million input tokens and $6 per million output tokens, one rotation-decision request costs:
| Step | Tokens | Rate (per million) | Cost |
|---|---|---|---|
| Input (policy line + account context + instructions) | 250 | $2.00 | $0.0005 |
| Output (JSON fields) | 80 | $6.00 | $0.00048 |
| Total | ~$0.001 |
That looks tiny, but 10,000 decisions per day becomes about $9.80 per day, or $3,577 per year at those illustrative rates. Log usage fields per task and join them to the rate card selected by your application.
The log record should help operators without copying customer text into telemetry. If you need to correlate repeated inputs, use a keyed fingerprint rather than a plain hash. The lab key below is another placeholder; production code injects it from secret storage.
1import hmac
2from hashlib import sha256
3
4trace_id = "trace_acct_10234_01"
5customer_question = "Can I rotate the stale service-account key for account acct_10234?"
6fingerprint_key = b"dev-only-fingerprint-key"
7usage = {"input_tokens": 250, "output_tokens": 80}
8rates_per_million = {"input": 2.0, "output": 6.0} # illustrative only
9cost_usd = (
10 usage["input_tokens"] * rates_per_million["input"]
11 + usage["output_tokens"] * rates_per_million["output"]
12) / 1_000_000
13log_record = {
14 "trace_id": trace_id,
15 "task": "decide_rotation",
16 "prompt_version": "rotation_decision@2",
17 "input_fingerprint": hmac.new(fingerprint_key, customer_question.encode(), sha256).hexdigest()[:12],
18 **usage,
19 "estimated_cost_usd": round(cost_usd, 6),
20 "status": "ok",
21}
22
23assert customer_question not in str(log_record)
24print(log_record)1{'trace_id': 'trace_acct_10234_01', 'task': 'decide_rotation', 'prompt_version': 'rotation_decision@2', 'input_fingerprint': '85ef707a963b', 'input_tokens': 250, 'output_tokens': 80, 'estimated_cost_usd': 0.00098, 'status': 'ok'}Don't log raw prompts by default. A keyed fingerprint isn't anonymization either. Keep it only when correlation value justifies retention, protect its key, and apply access and retention controls.
Prompts can contain names, emails, API keys, private documents, or customer messages. OWASP's 2025 Top 10 for LLM Applications lists prompt injection and sensitive information disclosure among the central application risks; logs that retain prompt data extend the places where that data must be protected.[4] NIST's AI Risk Management Framework organizes risk work around governance, mapping, measuring, and managing risk, which supports traceable controls without defaulting to raw-input retention.[5]
Log IDs, usage, failure classes, and carefully redacted excerpts only when you have a clear retention policy. Add a keyed fingerprint only when you need repeat-input correlation.
Once the wrapper works, choose the execution shape that matches the job.
| Shape | Use when | Watch out for |
|---|---|---|
| Synchronous request | User needs an immediate answer. | Timeout and loading state. |
| Streaming request | User benefits from partial text. | Partial output can fail later. |
| Background job | Task is slow, retryable, or long-running. | Needs status polling and persistence. |
| OpenAI Batch API | Many offline records can wait. | Results are asynchronous and need reconciliation.[6] |
| OpenAI prompt caching | A long stable prefix repeats across requests. | Hits require exact prefix matches, so put stable content first and inspect cached-token usage.[7] |
The beginner mistake is trying to use one shape for all work.
A support-agent draft, nightly audit-log enrichment job, and bulk rotation-policy evaluation don't need the same execution pattern.
Read this bad design and write down at least three problems before you look at the answer:
1The `/rotation-decision` route calls the provider SDK directly.
2It waits up to 90 seconds.
3It retries all errors three times.
4It returns whatever text the model produced.
5It logs the full prompt and answer.Take a minute to think. What's risky? What's missing?
Strong answers usually include these issues:
| Problem | Better design |
|---|---|
| SDK call is scattered in route. | Move it behind a wrapper with a task contract. |
| Long blocking timeout. | Use a shorter timeout or a background job with status polling. |
| Retries all errors. | Retry only safe temporary failures with a cap and backoff. |
| Trusts raw text. | Validate structured output before returning it to the caller. |
| Logs private prompt. | Log trace IDs, metadata, status, and redacted excerpts. |
If you named three of those, you've located real production failure paths.
A model call behaves like a production dependency only when the wrapper owns the operational contract: keep credentials on the server, bound waiting time, retry transient failures with jitter, make actions idempotent, validate policy-grounded outputs, and record useful telemetry without storing raw customer text.
The next lesson will put this wrapper behind a small web workflow. Its form and database record are trustworthy only if this boundary already returns explicit success and failure states.
| Symptom | Cause | Fix |
|---|---|---|
| Credentials appear in commits or browser bundles | API key was treated like public configuration | Load it only in server code through environment injection or a secret manager; rotate exposed keys. |
| One route retries while another logs raw prompts | Provider calls are scattered across endpoints | Send every request through one wrapper that owns timeout, validation, retry, and telemetry policy. |
| Support screen spins without a clear outcome | Client waits too long for one provider response | Set a deadline and return a named timeout state, stream only useful drafts, or enqueue long work. |
Valid JSON claims a 90-day window for P-7 | Parsing was mistaken for business validation | Check cited policy lines and reject unsupported values. |
| Two rotation jobs exist for one account | An uncertain timeout caused a repeated side effect | Give action calls an idempotency key and return the stored result for repeats. |
| Logs contain customer messages or secrets | Raw prompts were retained by default | Log IDs, usage, failure classes, and keyed fingerprints only when needed. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
Function calling
OpenAI · 2026 · OpenAI API Docs
Structured outputs
OpenAI · 2024
Exponential Backoff And Jitter
Brooker, M. (AWS) · 2015
OWASP Top 10 for Large Language Model Applications
OWASP Foundation · 2025
Artificial Intelligence Risk Management Framework (AI RMF 1.0)
National Institute of Standards and Technology · 2023
OpenAI Batch API Guide
OpenAI · 2026
Prompt caching
OpenAI · 2026