LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesCalling LLM APIs in Production
⚙️EasyMLOps & Deployment

Calling LLM APIs in Production

Turn a grounded prompt into a reliable API boundary with server-side secrets, typed results, bounded retries, safe actions, and useful telemetry.

16 min read
Learning path
Step 28 of 158 in the full curriculum
Prompt Engineering FundamentalsFirst AI App End-to-End

You can test the stale-key prompt from the previous lesson and get the right answer once. That doesn't make it a production feature.

A production large language model (LLM) call has to survive slow networks, invalid outputs, user cancellation, private data, and retries. Suppose Luna's support screen asks whether stale service-account key acct_10234 is eligible under policy line P-7. The answer must arrive through a boundary that protects credentials, validates evidence, records failure, and keeps rotation-job creation separate. Build that boundary.

Application boundary: Put the model call behind one small application boundary. The rest of the app calls your boundary, not a provider SDK (Software Development Kit) directly.

Production LLM wrapper for account acct_10234. A compact task contract flows into a server-side wrapper around the model call, then into a validated typed decision. Telemetry captures trace, latency, tokens, and status without raw credential text, while rotation-job creation stays behind a separate idempotent action boundary. Production LLM wrapper for account acct_10234. A compact task contract flows into a server-side wrapper around the model call, then into a validated typed decision. Telemetry captures trace, latency, tokens, and status without raw credential text, while rotation-job creation stays behind a separate idempotent action boundary.
The model endpoint is one dependency inside application-owned controls. The wrapper keeps credentials on the server, bounds waiting and retries, validates the candidate, records redacted telemetry, and leaves rotation-job creation behind a separate idempotent action boundary.

Keep credentials on the server

A hosted model request usually uses an API key or another server-side credential. If you paste it into source code or send it to a browser, anyone who obtains it may be able to make requests against your account. Keep provider keys out of browser code, mobile clients, repositories, and logs.

Locally, read a credential from an environment variable. In a deployed service, use the platform's secret manager and inject the value only into the server process. Don't put a provider key in a public frontend variable.

The lab uses a placeholder rather than a real key and proves only that server configuration is present.

server-secret-check.py
1import os 2 3os.environ.setdefault("LLM_API_KEY", "dev_key_not_a_real_secret") 4 5api_key = os.environ.get("LLM_API_KEY", "") 6if len(api_key) < 12: 7 raise RuntimeError("LLM_API_KEY is not set") 8 9print({"loaded": True, "server_only": True})
Server secret check
1{'loaded': True, 'server_only': True}

This pattern avoids printing even part of the secret while still proving that server configuration is present. A real app should also separate development and production credentials and rotate a key after any suspected leak.

A task contract for the model

Treat an LLM API call like a typed task crossing a service boundary.

You don't paste a private support thread into an arbitrary chat box and hope the answer is safe. You send a typed task contract with an account ID, requested decision, policy constraints, deadline, and maybe a "run once" marker. When the result comes back, code checks that it matches the request before committing it.

An LLM request needs the same idea:

Contract fieldLLM app equivalentWhy it matters
Task nameRequest schemaThe app knows what it asked for.
Account IDUser or task IDLogs can trace the result.
Data handling rulePrivacy rulePrivate details aren't logged by default.
DeadlineTimeoutThe UI doesn't hang forever.
Duplicate rotation job checkIdempotency keyA retry doesn't create duplicate side effects.
Policy checkResponse validationUnsupported decisions don't enter the product.

The model sees text and tools. Your application sees contracts. That separation is the first professional habit.

Building a request contract

Start by naming what your app owns.

A provider request may use messages, tools, response formats, model IDs, and sampling settings. Those fields differ across providers and change over time. Keep those details out of route handlers.

Instead, define a small internal request.

building-a-request-contract.py
1from dataclasses import dataclass 2from typing import Literal 3 4@dataclass(frozen=True) 5class LLMTask: 6 task: Literal["decide_rotation", "summarize_scan", "draft_reply"] 7 account_id: str 8 policy_line_ids: tuple[str, ...] 9 customer_question: str 10 trace_id: str 11 prompt_version: str 12 timeout_seconds: float = 20.0 13 max_retries: int = 2 14 15task = LLMTask( 16 task="decide_rotation", 17 account_id="acct_10234", 18 policy_line_ids=("P-7",), 19 customer_question="Can I rotate the stale service-account key?", 20 trace_id="trace_acct_10234_01", 21 prompt_version="rotation_decision@2", 22) 23 24print(task.task, task.account_id, task.policy_line_ids)
Application contract
1decide_rotation acct_10234 ('P-7',)

This object doesn't know whether you use OpenAI, Anthropic, a local model, or a gateway.

It describes application intent. Notice what it doesn't contain: permission to create a rotation job or grant approval. Deciding eligibility and executing an action are separate boundaries.

What belongs in the wrapper

The wrapper owns the parts that must be consistent across the product:

ConcernWrapper responsibility
Model selectionChoose a model or route based on task.
Prompt versionAttach the prompt template version used for this call.
TimeoutStop waiting before the user experience breaks.
Retry policyRetry only safe errors, with a cap.
ValidationParse and validate output before returning it.
TelemetryLog latency, tokens, status, and failure reason.
PrivacyRedact or avoid logging customer fields.

If route handlers call the provider directly, these rules drift: one route retries five times while another never retries; one logs raw prompts while another logs nothing; one validates JSON while another trusts a string. That inconsistency becomes technical debt fast.

The provider adapter can turn the application contract into its own request payload. The teaching payload below is deliberately provider-shaped rather than SDK-specific: a real adapter maps its messages and required fields into the schema format its provider supports. Keep that translation testable too:

build-provider-payload.py
1import json 2 3policy_lines = {"P-7": "Stale service-account keys must rotate within 30 days."} 4task = { 5 "account_id": "acct_10234", 6 "credential_age_days": 12, 7 "key_stale": True, 8 "policy_line_ids": ["P-7"], 9} 10 11context = "\n".join( 12 f"{line_id}: {policy_lines[line_id]}" for line_id in task["policy_line_ids"] 13) 14payload = { 15 "messages": [ 16 {"role": "system", "content": "Decide rotation eligibility only from supplied policy lines. Return JSON."}, 17 {"role": "user", "content": f"{context}\nAccount: {json.dumps(task, sort_keys=True)}"}, 18 ], 19 "required_response_fields": ["decision", "rotation_window_days", "source_line_ids"], 20} 21 22assert "P-7" in payload["messages"][1]["content"] 23assert "acct_10234" in payload["messages"][1]["content"] 24print(json.dumps(payload, indent=2))
Adapter payload
1{ 2 "messages": [ 3 { 4 "role": "system", 5 "content": "Decide rotation eligibility only from supplied policy lines. Return JSON." 6 }, 7 { 8 "role": "user", 9 "content": "P-7: Stale service-account keys must rotate within 30 days.\nAccount: {\"account_id\": \"acct_10234\", \"credential_age_days\": 12, \"key_stale\": true, \"policy_line_ids\": [\"P-7\"]}" 10 } 11 ], 12 "required_response_fields": [ 13 "decision", 14 "rotation_window_days", 15 "source_line_ids" 16 ] 17}

What the wrapper does

The flow below is provider-agnostic. In OpenAI's API, function calling and Structured Outputs are provider features that can express tools or output schemas; application validation and action authorization still belong outside the model response.[1][2]

Multi-lane request trace for account acct_10234. Product, provider, and operations lanes move through receive, attempt 0, jittered backoff, attempt 1, validation, and emit stages; a transient 429 recovers to a validated RotationDecision while every terminal path writes a CallRecord. Multi-lane request trace for account acct_10234. Product, provider, and operations lanes move through receive, attempt 0, jittered backoff, attempt 1, validation, and emit stages; a transient 429 recovers to a validated RotationDecision while every terminal path writes a CallRecord.
The wrapper maintains two synchronized views of one call. The product lane receives only a validated typed result, while the operations lane records attempts, backoff, usage, latency, and final status. Provider or validation failures return a named error and still write the call record.

Notice the two outputs:

  1. The user-facing result.
  2. The operational record.

Production work needs both.

Minimal wrapper

The next snippet uses a fake provider function so you can run the shape without credentials. It carries the exact policy decision from the previous lesson through parse, business validation, logging, and return.

minimal-wrapper.py
1import json 2import time 3from dataclasses import dataclass 4from typing import Any 5 6@dataclass(frozen=True) 7class RotationDecision: 8 decision: str 9 rotation_window_days: int 10 source_line_ids: tuple[str, ...] 11 12def fake_provider_call(prompt: str) -> str: 13 return json.dumps({ 14 "decision": "eligible", 15 "rotation_window_days": 30, 16 "source_line_ids": ["P-7"], 17 }) 18 19def parse_decision(raw: str, policy_windows: dict[str, int]) -> RotationDecision: 20 data: dict[str, Any] = json.loads(raw) 21 if data.get("decision") not in {"eligible", "not_eligible", "not_specified"}: 22 raise ValueError("decision is outside allowed enum") 23 if not isinstance(data.get("rotation_window_days"), int): 24 raise ValueError("rotation_window_days must be an integer") 25 if not isinstance(data.get("source_line_ids"), list): 26 raise ValueError("source_line_ids must be a list") 27 cited = tuple(data["source_line_ids"]) 28 if cited != ("P-7",) or data["rotation_window_days"] != policy_windows["P-7"]: 29 raise ValueError("decision is unsupported by supplied policy") 30 return RotationDecision( 31 decision=data["decision"], 32 rotation_window_days=data["rotation_window_days"], 33 source_line_ids=cited, 34 ) 35 36def decide_rotation(account_id: str, trace_id: str, now_fn=time.perf_counter) -> RotationDecision: 37 start = now_fn() 38 raw = fake_provider_call(f"Decide stale-key rotation eligibility for {account_id} using P-7.") 39 decision = parse_decision(raw, {"P-7": 30}) 40 latency_ms = (now_fn() - start) * 1000 41 print({"trace_id": trace_id, "status": "ok", "latency_ms": round(latency_ms, 2)}) 42 return decision 43 44fake_times = iter([100.0, 100.00005]) 45decision = decide_rotation( 46 "acct_10234", 47 "trace_acct_10234_01", 48 now_fn=lambda: next(fake_times), 49) 50print(decision)
Validated decision
1{'trace_id': 'trace_acct_10234_01', 'status': 'ok', 'latency_ms': 0.05} 2RotationDecision(decision='eligible', rotation_window_days=30, source_line_ids=('P-7',))

This example is small, but it already has three production habits:

  • The parser owns the response contract.
  • The wrapper logs a trace ID and status.
  • The caller receives a typed eligibility decision, not raw model text.

Timeouts, retries, and backoff

Each network call needs a timeout.

Without a timeout, one slow provider request can hold a web worker, keep a browser spinner open, and make the app look broken. A beginner often sets an overlong timeout to avoid errors. That hides the problem instead of solving it.

Use a timeout that matches the product moment:

Product momentTimeout shape
AutocompleteVery short, fail quietly.
Chat replyShort enough to keep user trust, maybe stream partial output.
Document summaryLonger, but show progress or move to a job.
Batch enrichmentBackground job with retry and checkpointing.

This runnable timeout keeps the user-facing deadline outside the fake provider. A slow response becomes a named failure state rather than an endless spinner:

bound-provider-wait.py
1import asyncio 2 3async def slow_provider() -> str: 4 await asyncio.sleep(0.02) 5 return "late decision" 6 7async def call_with_deadline(timeout_seconds: float) -> dict[str, str]: 8 try: 9 text = await asyncio.wait_for(slow_provider(), timeout=timeout_seconds) 10 return {"status": "ok", "text": text} 11 except TimeoutError: 12 return {"status": "timeout", "next_step": "show_retry_option"} 13 14print(asyncio.run(call_with_deadline(0.001)))
Timeout state
1{'status': 'timeout', 'next_step': 'show_retry_option'}

Retries need more care.

Retrying a temporary network error can be appropriate for a read-only decision call. Retrying an action after a timeout is different: the server may have created a rotation job even though the client never received the response. If the workflow can call a tool or update a database, enforce idempotency at that action boundary.

Not every error should be retried. A status code alone may not classify the failure: a provider can use the same code for a temporary rate limit and exhausted quota. Inspect provider error details too. Authentication errors, quota exhaustion, and malformed requests need a fix, not more traffic:

classify-retryable-failures.py
1def retry_decision(status_code: int, error_kind: str = "") -> str: 2 if status_code == 429 and error_kind == "quota_exhausted": 3 return "fail_and_fix_quota" 4 if status_code in {408, 429, 500, 502, 503, 504}: 5 return "retry_with_backoff" 6 if status_code in {400, 401, 403}: 7 return "fail_and_fix_request" 8 return "fail_and_inspect" 9 10cases = [(429, "rate_limit"), (429, "quota_exhausted"), (503, ""), (401, ""), (400, "")] 11for status, error_kind in cases: 12 print(status, error_kind or "-", retry_decision(status, error_kind))
Retry policy
1429 rate_limit retry_with_backoff 2429 quota_exhausted fail_and_fix_quota 3503 - retry_with_backoff 4401 - fail_and_fix_request 5400 - fail_and_fix_request

Exponential backoff with jitter

A good retry policy adds delay between attempts. If retries fire immediately, a brief provider outage becomes a thundering herd from your own servers. Exponential backoff waits a little longer each time: a base near 1 second, then 2, then 4. Backoff alone isn't enough. If a thousand workers all back off by exactly 1, 2, and 4 seconds, they still retry in synchronized waves. Adding random jitter spreads those retries across the window so the herd disperses. The common form is full jitter: pick a random delay between zero and the current exponential ceiling.[3]

Before you run the next snippet, predict the ceiling for attempt 1 and attempt 2.

This fake provider fails on the first two calls, then succeeds. The wrapper computes a longer ceiling each time it retries, then samples a jittered delay below it. In this testable version we inject a no-op sleep function and a fixed random source so the output stays deterministic while you run it.

exponential-backoff.py
1import time 2 3class RetryableModelError(Exception): 4 pass 5 6def flaky_call(attempt: int) -> str: 7 if attempt < 2: 8 raise RetryableModelError("temporary rate limit") 9 return "ok" 10 11def call_with_retry(max_retries: int = 2, sleep_fn=time.sleep, rand_fn=None) -> str: 12 rand_fn = rand_fn or (lambda: 0.5) 13 attempts = 0 14 while True: 15 try: 16 return flaky_call(attempts) 17 except RetryableModelError as exc: 18 if attempts >= max_retries: 19 print({"status": "failed", "reason": str(exc)}) 20 raise 21 attempts += 1 22 ceiling = 2 ** (attempts - 1) # 1s, 2s, 4s... 23 delay = round(rand_fn() * ceiling, 2) # full jitter: [0, ceiling) 24 print({"status": "retrying", "attempt": attempts, "ceiling_s": ceiling, "delay_s": delay}) 25 sleep_fn(delay) 26 27result = call_with_retry(sleep_fn=lambda _seconds: None) 28print(result)
Backoff schedule
1{'status': 'retrying', 'attempt': 1, 'ceiling_s': 1, 'delay_s': 0.5} 2{'status': 'retrying', 'attempt': 2, 'ceiling_s': 2, 'delay_s': 1.0} 3ok

Also log the final status. A retry that succeeds is still operationally important. A provider SDK may implement retry behavior, but your application still owns timeout, retry cap, and idempotency decision.

A timeout doesn't prove nothing happened

The eligibility decision is read-only. Creating a rotation job isn't. The action below stores an idempotency key before returning a job ID. If the network drops after the first call, retrying the same key returns the existing rotation job rather than creating another one.

idempotent-rotation-job.py
1created_rotation_jobs: dict[str, str] = {} 2 3def create_rotation_job_once(account_id: str, idempotency_key: str) -> tuple[str, bool]: 4 if idempotency_key in created_rotation_jobs: 5 return created_rotation_jobs[idempotency_key], False 6 job_id = f"job_{account_id}_001" 7 created_rotation_jobs[idempotency_key] = job_id 8 return job_id, True 9 10key = "create_rotation_job:acct_10234:P-7:v1" 11first = create_rotation_job_once("acct_10234", key) 12retry_after_timeout = create_rotation_job_once("acct_10234", key) 13 14assert first[0] == retry_after_timeout[0] 15assert len(created_rotation_jobs) == 1 16print({"first": first, "retry_after_timeout": retry_after_timeout})
Idempotent action
1{'first': ('job_acct_10234_001', True), 'retry_after_timeout': ('job_acct_10234_001', False)}

The dictionary makes sequential retries visible, but production code needs a durable, atomic claim. Store the idempotency key in a database row protected by a unique constraint, or use a provider-supported idempotency mechanism. Otherwise concurrent retries can both pass the first check and create duplicates.

From text to typed objects

Free-form text is useful for humans. Applications often need objects.

If the app asks for a rotation decision, downstream code wants fields such as decision, rotation_window_days, and source_line_ids. In OpenAI's API, Structured Outputs with a supported strict JSON schema enforces supported schema adherence, while JSON mode targets valid JSON and still has documented edge cases that callers must handle.[2] Neither mode proves an answer follows workspace policy. Your application still handles refusals or incomplete output and checks facts and permissions.

FieldSchema checkBusiness check
decisionMust be an allowed enum.eligible needs a cited applicable rule.
rotation_window_daysMust be an integer.Must equal the window in policy P-7.
source_line_idsMust be an array of strings.Every ID must exist in supplied evidence.

A common shortcut that breaks

The tempting shortcut is stopping once json.loads succeeds. Run both candidates below: both are valid JSON, but one contradicts policy P-7.

validate-rotation-policy.py
1import json 2 3policy_windows = {"P-7": 30} 4candidates = [ 5 '{"decision":"eligible","rotation_window_days":30,"source_line_ids":["P-7"]}', 6 '{"decision":"eligible","rotation_window_days":90,"source_line_ids":["P-7"]}', 7] 8 9def validate(raw: str) -> str: 10 data = json.loads(raw) 11 source_id = data["source_line_ids"][0] 12 if data["rotation_window_days"] != policy_windows.get(source_id): 13 return "REJECT unsupported rotation window" 14 return "ACCEPT grounded decision" 15 16for candidate in candidates: 17 print(validate(candidate))
Business-rule validation
1ACCEPT grounded decision 2REJECT unsupported rotation window

The second object parses correctly and still fails. Shape validation and business validation solve different problems.

When to stream

Streaming sends partial output as it's generated.

Use it when the user is waiting and partial text is useful: drafting a customer-facing explanation, a long answer, code generation, or voice. Don't make streaming your default execution mode. Eligibility extraction often works better as a normal request because the app needs one final validated object.

Use streamingPrefer non-streaming
Human reads text as it appears.App needs one JSON object.
Long answer improves perceived latency.Task is short and hidden.
User may stop generation.Result must validate before display.
UI can handle partial tokens.Caller wants one atomic success or failure.

Streaming adds state:

  • partial content buffer
  • cancellation path
  • final validation step
  • error display after partial output
  • logs that mark started, streaming, completed, or failed

If streaming output later fails validation, the UI needs a plan. For a draft reply, you may show partial text and mark it unapproved. For a rotation-job action, hold the action until final validation passes.

A tiny streaming handler can expose a draft while keeping action authorization outside the text stream:

when-to-stream.py
1from collections.abc import Iterator 2 3def fake_stream() -> Iterator[str]: 4 chunks = ["Account ", "acct_10234 ", "is eligible ", "under P-7."] 5 for chunk in chunks: 6 yield chunk 7 8buffer = "" 9for chunk in fake_stream(): 10 buffer += chunk 11 print({"event": "delta", "preview": buffer}) 12 13assert "rotation job" not in buffer.lower() 14print({"event": "completed", "display_only": True, "text": buffer})
Buffered draft stream
1{'event': 'delta', 'preview': 'Account '} 2{'event': 'delta', 'preview': 'Account acct_10234 '} 3{'event': 'delta', 'preview': 'Account acct_10234 is eligible '} 4{'event': 'delta', 'preview': 'Account acct_10234 is eligible under P-7.'} 5{'event': 'completed', 'display_only': True, 'text': 'Account acct_10234 is eligible under P-7.'}

Provider streaming endpoints commonly send incremental events over a persistent HTTP response. Your job is to buffer them, handle cancellation, and validate the final assembled result before you act on it.

What to log and what to hide

You can't manage what you don't measure.

Each call records enough information to debug, compare, and budget:

FieldExampleWhy it helps
trace_idtrace_acct_10234_01Connects app logs to provider call.
taskdecide_rotationGroups cost by feature.
modelprovider-model-idFinds regressions after model changes.
prompt_versionrotation_decision@2Connects output to prompt history.
latency_ms840Tracks user experience.
input_tokens921Explains cost and context size.
output_tokens84Finds verbose outputs.
statusok, timeout, invalid_outputSupports dashboards and alerts.

Estimate usage without logging customer text

Hosted APIs may charge for input and output tokens. Rates depend on provider and model, so keep the calculation separate from current pricing. With illustrative rates of $2 per million input tokens and $6 per million output tokens, one rotation-decision request costs:

StepTokensRate (per million)Cost
Input (policy line + account context + instructions)250$2.00$0.0005
Output (JSON fields)80$6.00$0.00048
Total~$0.001

That looks tiny, but 10,000 decisions per day becomes about $9.80 per day, or $3,577 per year at those illustrative rates. Log usage fields per task and join them to the rate card selected by your application.

The log record should help operators without copying customer text into telemetry. If you need to correlate repeated inputs, use a keyed fingerprint rather than a plain hash. The lab key below is another placeholder; production code injects it from secret storage.

record-redacted-usage.py
1import hmac 2from hashlib import sha256 3 4trace_id = "trace_acct_10234_01" 5customer_question = "Can I rotate the stale service-account key for account acct_10234?" 6fingerprint_key = b"dev-only-fingerprint-key" 7usage = {"input_tokens": 250, "output_tokens": 80} 8rates_per_million = {"input": 2.0, "output": 6.0} # illustrative only 9cost_usd = ( 10 usage["input_tokens"] * rates_per_million["input"] 11 + usage["output_tokens"] * rates_per_million["output"] 12) / 1_000_000 13log_record = { 14 "trace_id": trace_id, 15 "task": "decide_rotation", 16 "prompt_version": "rotation_decision@2", 17 "input_fingerprint": hmac.new(fingerprint_key, customer_question.encode(), sha256).hexdigest()[:12], 18 **usage, 19 "estimated_cost_usd": round(cost_usd, 6), 20 "status": "ok", 21} 22 23assert customer_question not in str(log_record) 24print(log_record)
Redacted usage record
1{'trace_id': 'trace_acct_10234_01', 'task': 'decide_rotation', 'prompt_version': 'rotation_decision@2', 'input_fingerprint': '85ef707a963b', 'input_tokens': 250, 'output_tokens': 80, 'estimated_cost_usd': 0.00098, 'status': 'ok'}

Don't log raw prompts by default. A keyed fingerprint isn't anonymization either. Keep it only when correlation value justifies retention, protect its key, and apply access and retention controls.

Prompts can contain names, emails, API keys, private documents, or customer messages. OWASP's 2025 Top 10 for LLM Applications lists prompt injection and sensitive information disclosure among the central application risks; logs that retain prompt data extend the places where that data must be protected.[4] NIST's AI Risk Management Framework organizes risk work around governance, mapping, measuring, and managing risk, which supports traceable controls without defaulting to raw-input retention.[5]


Log IDs, usage, failure classes, and carefully redacted excerpts only when you have a clear retention policy. Add a keyed fingerprint only when you need repeat-input correlation.

Choosing the execution shape

Once the wrapper works, choose the execution shape that matches the job.

ShapeUse whenWatch out for
Synchronous requestUser needs an immediate answer.Timeout and loading state.
Streaming requestUser benefits from partial text.Partial output can fail later.
Background jobTask is slow, retryable, or long-running.Needs status polling and persistence.
OpenAI Batch APIMany offline records can wait.Results are asynchronous and need reconciliation.[6]
OpenAI prompt cachingA long stable prefix repeats across requests.Hits require exact prefix matches, so put stable content first and inspect cached-token usage.[7]

The beginner mistake is trying to use one shape for all work.

A support-agent draft, nightly audit-log enrichment job, and bulk rotation-policy evaluation don't need the same execution pattern.

Spot the problems

Read this bad design and write down at least three problems before you look at the answer:

text
1The `/rotation-decision` route calls the provider SDK directly. 2It waits up to 90 seconds. 3It retries all errors three times. 4It returns whatever text the model produced. 5It logs the full prompt and answer.

Take a minute to think. What's risky? What's missing?

Strong answers usually include these issues:

ProblemBetter design
SDK call is scattered in route.Move it behind a wrapper with a task contract.
Long blocking timeout.Use a shorter timeout or a background job with status polling.
Retries all errors.Retry only safe temporary failures with a cap and backoff.
Trusts raw text.Validate structured output before returning it to the caller.
Logs private prompt.Log trace IDs, metadata, status, and redacted excerpts.

If you named three of those, you've located real production failure paths.

Carry the boundary into an app

A model call behaves like a production dependency only when the wrapper owns the operational contract: keep credentials on the server, bound waiting time, retry transient failures with jitter, make actions idempotent, validate policy-grounded outputs, and record useful telemetry without storing raw customer text.

The next lesson will put this wrapper behind a small web workflow. Its form and database record are trustworthy only if this boundary already returns explicit success and failure states.

Mastery check

Key concepts

  • Secure API key handling and environment variables
  • Provider-agnostic LLM request contract
  • Timeouts, retry classification, jitter, and idempotent actions
  • Structured outputs and schema validation
  • Streaming response handling
  • Cost, latency, and error logging

Evaluation rubric

  • Foundational: Stores API keys in environment variables and keeps them out of source code
  • Foundational: Separates application intent from provider-specific API fields
  • Foundational: Turns a slow provider response into a named timeout state
  • Foundational: Retries only transient failures with bounded jittered backoff
  • Intermediate: Keeps side-effecting actions idempotent when a response may have been lost
  • Intermediate: Validates structured responses before trusting model output
  • Intermediate: Logs latency, token usage, model version, and failure reasons for each call
  • Intermediate: Explains when to use streaming, batch work, prompt caching, or background jobs

Follow-up questions

Common pitfalls

SymptomCauseFix
Credentials appear in commits or browser bundlesAPI key was treated like public configurationLoad it only in server code through environment injection or a secret manager; rotate exposed keys.
One route retries while another logs raw promptsProvider calls are scattered across endpointsSend every request through one wrapper that owns timeout, validation, retry, and telemetry policy.
Support screen spins without a clear outcomeClient waits too long for one provider responseSet a deadline and return a named timeout state, stream only useful drafts, or enqueue long work.
Valid JSON claims a 90-day window for P-7Parsing was mistaken for business validationCheck cited policy lines and reject unsupported values.
Two rotation jobs exist for one accountAn uncertain timeout caused a repeated side effectGive action calls an idempotency key and return the stored result for repeats.
Logs contain customer messages or secretsRaw prompts were retained by defaultLog IDs, usage, failure classes, and keyed fingerprints only when needed.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A browser support screen needs an LLM decision for account acct_10234. Which deployment plan keeps the provider credential behind the right boundary?
2.Two route handlers call the provider SDK directly. One retries five times and logs raw prompts; the other never retries and returns raw text. Which refactor fixes the boundary problem?
3.A read-only eligibility call can fail with 503, 429 rate_limit, 429 quota_exhausted, or 401. Which retry policy should the wrapper use?
4.The policy table says P-7 allows stale-key rotation within 30 days. A parsed model response is {"decision":"eligible","rotation_window_days":90,"source_line_ids":["P-7"]}. What should the wrapper do?
5.A team has four workloads: a long agent draft, one validated eligibility object, 50,000 offline scan records, and requests sharing an exact long policy prefix. Which mapping fits them?
6.A support UI has a 2-second deadline, but the provider has not responded when that deadline expires. What should the application boundary do?
7.A rotation-job request times out after reaching the server, so the client cannot tell whether a rotation job was created. How should the client and action boundary handle the retry?
8.A call uses 250 input tokens at 2permillionand80outputtokensat2 per million and 80 output tokens at 2permillionand80outputtokensat6 per million. Which telemetry plan records useful operations data with the correct estimated cost?
9.A UI streams a draft explanation while a rotation-job action remains possible. The final assembled stream fails validation. What should the handler do?

9 questions remaining.

Next Step
Continue to First AI App End-to-End

You now know the API boundary: requests, retries, validation, and telemetry. The next chapter assembles those pieces into a usable AI workflow with inputs, model calls, outputs, and tests.

PreviousPrompt Engineering Fundamentals
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Function calling

OpenAI · 2026 · OpenAI API Docs

Structured outputs

OpenAI · 2024

Exponential Backoff And Jitter

Brooker, M. (AWS) · 2015

OWASP Top 10 for Large Language Model Applications

OWASP Foundation · 2025

Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology · 2023

OpenAI Batch API Guide

OpenAI · 2026

Prompt caching

OpenAI · 2026