LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalStructured Output Generation
🤖HardLLM Agents & Tool Use

Structured Output Generation

Build reliable LLM interfaces with JSON mode, structured outputs, schema validation, and grammar-guided decoding.

40 min read
Learning path
Step 113 of 155 in the full curriculum
RAG Security & Access ControlReAct & Plan-and-Execute

The previous lesson secured which documents can enter a prompt. This lesson secures what leaves the model: typed data that downstream code can parse, validate, and route without fragile string cleanup.

Structured output generation turns free-form model text into typed records that software can validate and route. This article explains why schemas, constrained decoding, and explicit recovery paths matter whenever LLM output feeds another system.

Imagine an e-commerce warehouse that processes customer emails automatically. A customer writes, "I want to return the blue jacket I bought last week. My order number is A102." Your pipeline needs to extract the order ID and the intent so it can route the message to the returns department and look up the order in the database.

You ask an LLM to extract this information and return JSON. If the model replies with, "Sure! Here is the extracted data: { 'order_id': 'A102', 'intent': 'return' }", your parser crashes because of the preamble text. That single failure blocks an entire automation pipeline.

When LLM output feeds software, it must be machine-parseable and validated. A missing comma, an unexpected field name, or a polite introduction can break downstream processing. Structured output generation is the set of techniques that make model outputs conform to a defined schema, replacing fragile string cleanup.

This article covers the practical range from best-effort JSON mode to -level constraint enforcement via grammar-guided decoding [1], open-source runtimes like Outlines and SGLang [2][3][4], and hosted patterns using OpenAI's Structured Outputs feature.[5]

Constrained decoding visual showing candidate tokens, a grammar mask that keeps only the legal token, and a grammar-valid JSON path. Constrained decoding visual showing candidate tokens, a grammar mask that keeps only the legal token, and a grammar-valid JSON path.
Constrained decoding enforces structure inside the sampler. Illegal next-token choices disappear before the model can pick them.

The illustration above shows how a grammar engine filters the model's next-token probabilities. Invalid tokens (like a word when a number is required) are masked to negative infinity, so the sampler can only pick grammar-compliant tokens.

Why "please return JSON" is not enough

Before looking at solutions, let's see how naive prompting fails. Suppose you send this prompt to a chat model:

why-please-return-json-is-not-enough.py
1prompt = """ 2Extract order_id and intent from this email: 3"I'd like to return order A102. The blue jacket doesn't fit." 4Return JSON. 5"""

Here are three real ways this can go wrong:

Failure modeExample outputWhy it breaks
Preamble text"Here is the JSON: { 'order_id': 'A102' }"The parser sees Here before the brace and throws
Markdown wrapper"json\n{ 'order_id': 'A102' }\n"The triple backticks and newlines are not valid JSON
Wrong shape{ "order_id": "A102", "intent": null }Your database expects intent to be a string, not null

None of these are model "hallucinations" in the usual sense. The model followed a weak natural-language request. The problem is that your request was underspecified. You asked for JSON, but you didn't enforce JSON.

The fix is to move from asking to enforcing. The rest of this article shows the enforcement tiers, from weakest to strongest.

Contract choices and enforcement mechanisms

Engineers need to choose both the contract exposed to application code and the mechanism that enforces it. These choices aren't a strict ladder: a hosted structured-output API may enforce a JSON Schema using constrained decoding internally, while a self-hosted runtime may expose grammar-guided decoding directly. Production machine-consumed output needs a checked contract, regardless of where enforcement runs.

ChoiceStructural contractTypical implementation
Prompt-based ("respond in JSON")None; best-effort formatting onlyMost chat APIs
JSON modeValid JSON syntax on successful completion, with edge cases to handleOpenAI JSON mode, similar API flags
Strict Schema-constrained arguments for an action requestTool-calling APIs and agent runtimes
Structured-output APISchema adherence for supported schemasOpenAI Structured Outputs, schema-aware SDKs
Grammar-guided runtimeValid path through a supplied grammar or schema translationOutlines, SGLang, llama.cpp
Structured output choices comparing prompt-only formatting, JSON mode, schema-aware APIs, and grammar-guided runtimes. Structured output choices comparing prompt-only formatting, JSON mode, schema-aware APIs, and grammar-guided runtimes.
Choose a contract your code can rely on, then understand where enforcement occurs. Hosted schema APIs can use the same constrained-decoding idea exposed by self-hosted grammar runtimes.

To see how these differ in practice, imagine the same returns email. Here's what each tier gives you:

Prompt-based: The model might return valid JSON. It might return a monologue. You need a regex to clean the output, then a JSON parser, then a schema validator, then a retry loop. This is best-effort formatting, not enforcement.

JSON mode: The API constrains successful completions to valid JSON syntax. You still need to explicitly tell the model to produce JSON, detect incomplete outputs, and validate the shape yourself. JSON mode could still return {"foo": "bar"} when you wanted {"order_id": "string", "intent": "string"}. JSON mode is syntax enforcement, not schema enforcement.[5]

Structured-output APIs: The API takes a JSON Schema or Pydantic model and, when the request completes without refusal or truncation, returns output that matches supported schema features. The fields, types, and required properties are enforced within that supported subset.[5]

Grammar-guided decoding: A runtime prevents the sampler from picking tokens that would break its compiled grammar. At every step, invalid tokens have their probabilities set to zero. This describes an enforcement mechanism inside generation, not a stronger semantic guarantee: a grammar-valid payload can still contain wrong values.[1]

This executable comparison demonstrates the remaining gap after JSON syntax succeeds: an object can parse correctly and still fail its application schema.

json-syntax-versus-schema-contract.py
1import json 2 3from pydantic import BaseModel, ConfigDict, ValidationError 4 5class DeliveryUpdate(BaseModel): 6 model_config = ConfigDict(extra="forbid") 7 8 order_id: str 9 intent: str 10 11syntax_valid_but_wrong_shape = json.loads('{"foo": "bar"}') 12schema_valid = json.loads('{"order_id": "A102", "intent": "return"}') 13 14try: 15 DeliveryUpdate.model_validate(syntax_valid_but_wrong_shape) 16except ValidationError: 17 print("JSON-only result: rejected by schema") 18 19print("structured record:", DeliveryUpdate.model_validate(schema_valid).model_dump())
Output
1JSON-only result: rejected by schema 2structured record: {'order_id': 'A102', 'intent': 'return'}

The fulfillment-lane analogy

Imagine a package moving through a fulfillment lane with no guides. It might reach the right chute, or it might drift into the wrong bin. This is prompt-based generation: you ask the model to "please produce JSON," but nothing stops it from outputting a monologue instead.

Grammar-guided decoding is like a guided sortation lane. Rails prevent the package from entering an invalid chute. In the context of an LLM, the rails are a finite state machine (FSM, a computational model that tracks transitions between allowed states) that monitors generation. If the schema requires an integer for quantity, the FSM blocks all non-digit tokens. The decoder can't select a letter token that violates the grammar, the same way a package can't leave the allowed lane.

How grammar-guided decoding works

To understand how runtimes enforce structural compliance, start by examining the token generation process itself. By intercepting the token sampling phase, the runtime restricts selection to tokens that match the desired schema. The following diagram illustrates how grammar constraints act as a filter during the token generation cycle:

Diagram showing LLM proposes next token, Grammar state, Sample valid token, and Mask invalid to -∞. Diagram showing LLM proposes next token, Grammar state, Sample valid token, and Mask invalid to -∞.
LLM proposes next token, Grammar state, Sample valid token, and Mask invalid to -∞.

Grammar-guided enforcement operates at the token sampling level. It transforms a schema or grammar into constraints that modify the model's output probabilities during generation.

From schema to logit masking

  1. Schema Compilation: For regular languages such as Regex, the runtime can compile the allowed strings into a Deterministic Finite Automaton (DFA) (a state machine where the next state is uniquely determined by the current state and input symbol). For general JSON schemas, especially nested or recursive ones, many systems instead compile to a context-free grammar or another stack-aware parser representation.
    • Example Schema: {"type": "object", "properties": {"age": {"type": "integer"}}}
    • Regular Expression (Regex) Equivalent: \{"age":\s*[0-9]+\}
  2. State Tracking: As the model generates tokens, the engine tracks the current DFA state or parser stack for the active grammar.
  3. Logit Masking: At each step, the engine identifies which tokens are valid transitions.
    • Scenario: The model has generated {"age": .
    • Valid Next Characters / Bytes: 1, 2, 3, ... 9, 0, (space).
    • Invalid Next Characters / Bytes: ", a, b, {, [, etc.
    • Action: The runtime maps those allowed prefixes back to token IDs, then sets the logits (unnormalized probabilities) of all invalid token IDs to −∞-\infty−∞. When softmax is applied, their probability becomes 0.

The toy code below shows the core masking idea without requiring a model checkpoint. The grammar state says that only { is legal as the next token, so every other candidate is assigned negative infinity before sampling:

from-schema-to-logit-masking.py
1def mask_invalid_tokens(logits: dict[str, float], valid_tokens: set[str]) -> dict[str, float]: 2 return { 3 token: score if token in valid_tokens else float("-inf") 4 for token, score in logits.items() 5 } 6 7next_token_logits = { 8 "The": 0.45, 9 "{": 0.30, 10 "Sure": 0.15, 11 "JSON": 0.10, 12} 13 14masked = mask_invalid_tokens(next_token_logits, valid_tokens={"{"}) 15chosen = max(masked, key=masked.get) 16valid_after_mask = [token for token, score in masked.items() if score != float("-inf")] 17print("valid after mask:", valid_after_mask) 18print("chosen token:", chosen)
Output
1valid after mask: ['{'] 2chosen token: {

This enforces structural adherence to the allowed grammar as long as generation can continue normally. In production, you still need to handle refusals, truncation, and semantic mistakes in the returned values.[5]

The tokenizer alignment problem

The clean DFA story above hides the hardest systems detail. Grammars are usually written over characters or bytes, but LLMs don't sample characters. They sample subword tokens. That means the runtime must answer a much harder question than "is } valid here?" It must answer "which of the 100,000+ vocabulary items could extend some valid character prefix from this state?" [1]

That token-to-grammar alignment step is where a lot of the real engineering work lives. Outlines precomputes an index used during guided generation, while SGLang's paper explicitly calls out compressed finite state machines for faster structured decoding.[1][3] If an implementation does this naively, masking can become a noticeable part of per-token latency.

Hosted APIs can hide some of that machinery, but not its cost. OpenAI's docs note this first-schema latency directly: the first request with any new schema can be slower while the API processes the schema, and later requests with the same schema reuse that work. A separate note adds extra first-request latency for fine-tuned models specifically.[5]

Regex vs. context-free grammars

When working with flat data structures, regular expressions (Regex) compiled to DFAs work well. But what if your JSON object contains lists of other objects, or deeply nested dictionaries?

Regular-language constraints can't represent arbitrarily nested structures like general JSON. To enforce nested or recursive schemas, engines need a Context-Free Grammar (CFG) or another stack-aware parser representation. Such representations track open brackets [ and { until matching closing brackets ] and }. Modern libraries handle translation automatically; benchmark nested schemas because parser state and output size can affect latency.

Open-source engines: Outlines, llama.cpp, SGLang, and XGrammar

While hosted Structured Outputs APIs are convenient, open-source runtimes give you direct control over constrained decoding for self-hosted models. For teams deploying open-weight models, managed API features may not fit. Instead, they use inference engines or libraries that support constrained generation natively and can cache compiled grammars or shared prefixes inside the serving stack.

Outlines

Outlines [1][2] provides a high-level Python interface for structured generation. In current releases, you build an Outlines model once, then pass the target type when you call it. The model wrapper reuses the tokenizer and generation machinery across calls, while the schema stays explicit at the call site.

This is an integration example, not a no-dependency local script. It requires outlines, transformers, pydantic, a compatible model backend, and enough local compute to load the chosen model.

outlines.py
1from typing import Literal 2 3import outlines 4from pydantic import BaseModel 5from transformers import AutoModelForCausalLM, AutoTokenizer 6 7class Character(BaseModel): 8 name: str 9 role: Literal["Warrior", "Mage", "Rogue"] 10 level: int 11 12model_name = "microsoft/Phi-3-mini-4k-instruct" 13model = outlines.from_transformers( 14 AutoModelForCausalLM.from_pretrained(model_name, device_map="auto"), 15 AutoTokenizer.from_pretrained(model_name), 16) 17 18raw = model( 19 "Create a level-5 fantasy RPG character.", 20 output_type=Character, 21 max_new_tokens=120, 22) 23character = Character.model_validate_json(raw)

Enums and required fields can be enforced during decoding, but business rules still belong in post-validation. For example, if level must be between 1 and 100, validate that in your application even if the grammar already narrows the shape.

llama.cpp

llama.cpp exposes grammar constraints directly through GBNF and can also convert a subset of JSON Schema into grammars for its server and CLI flows.[6][7] One subtle but important detail from its docs is that the schema is used to constrain decoding, not automatically to teach the model what the fields mean. For plain structured generation, you still want prompt instructions that explain the task and the semantics of the fields you're asking for.[7]

SGLang

SGLang (Structured Generation Language) [3][4] goes further by optimizing the runtime for structured workloads. Its paper describes two separate ideas that are easy to conflate:

  1. RadixAttention reuses KV cache for shared token prefixes across requests.
  2. Compressed finite state machines speed up structured decoding itself.[3]

Under the hood, SGLang stores token-prefix mappings in a radix tree and reuses previously computed KV cache when a new request shares a prompt prefix.[3] That helps Time To First Token (TTFT) when many requests reuse the same system prompt, few-shot examples, or schema instructions. It's a prefix-caching optimization, not a grammar-state cache.

SGLang also provides a domain-specific language (DSL, a specialized syntax for a particular application domain) for interleaving Python control flow with LLM generation. Its structured-output documentation exposes JSON Schema, regex, and grammar constraints; its runtime design separately addresses cache reuse.[4][3]

XGrammar

XGrammar [8] is a grammar engine built specifically to make context-free-grammar decoding cheap enough for production serving. It attacks the tokenizer-alignment cost from the previous section head on. The core idea is to split the vocabulary into context-independent tokens, whose validity can be precomputed regardless of the parser stack, and context-dependent tokens, which must be checked at runtime against the current stack. A persistent stack and overlap with GPU execution then shrink the per-token mask cost further.[8]

The paper reports up to 100x faster grammar processing than evaluated prior approaches and near-zero structured-generation overhead in its end-to-end serving experiments.[8] Serving stacks can expose engines such as XGrammar behind a higher-level structured-output interface; verify which backend and benchmark settings your deployment actually uses.

Function calling vs. structured outputs

Both techniques constrain model outputs, but they solve different orchestration problems in an AI system. It's common to blur them together because both can involve JSON schemas. Some APIs also allow strict tool-argument schemas, but the core distinction still holds: function calling is about selecting actions, while structured outputs are about returning data in a fixed contract.

AspectFunction CallingStructured Outputs
Primary GoalAction Selection (Tool Use)Data Extraction / Formatting
TriggerApplication may allow or require a tool call; model supplies argumentsApplication requests a format for the returned record
OutputFunction name + argumentsArbitrary JSON object
Control FlowLoop: Model -> Code -> ModelLinear: Model -> Parser -> App

When to use function calling

Use function calling when the model is an agent that needs to interact with the world. In a typical tool loop, the application exposes allowed tools and may let the model choose one or require a specific call. The application executes an accepted call, then returns its result to the model in a subsequent turn. This allows the agent to formulate a final user-facing response based on fetched data.

Here is an example of defining a tool for an agent to fetch order status. If tool choice is left automatic, the model can either call the function with structured arguments or return a normal text response:

when-to-use-function-calling.py
1tools = [ 2 { 3 "type": "function", 4 "name": "get_order_status", 5 "description": "Fetch current fulfillment status for an order.", 6 "strict": True, 7 "parameters": { 8 "type": "object", 9 "properties": { 10 "order_id": {"type": "string"} 11 }, 12 "required": ["order_id"], 13 "additionalProperties": False 14 } 15 } 16] 17 18tool_schema = tools[0] 19example_routes = { 20 "Where is order A102?": "get_order_status", 21 "Hi!": "text_response", 22} 23print("tool:", tool_schema["name"]) 24print("strict:", tool_schema["strict"]) 25print("routes:", example_routes)
Output
1tool: get_order_status 2strict: True 3routes: {'Where is order A102?': 'get_order_status', 'Hi!': 'text_response'}

If your provider supports strict tool schemas, enable them. That improves argument reliability, but the control flow is still tool invocation rather than terminal data extraction. On OpenAI, manual strict tool schemas should set strict: true, list every parameter in required, and close objects with additionalProperties: false. parallel_tool_calls=false is useful when your application expects zero or one call, but that's an orchestration choice, not the definition of strictness.[5]

When to use structured outputs

Use structured outputs when you need to extract data or ensure a reliable interface between the LLM and your code. Unlike the multi-turn loop of function calling, structured outputs are typically terminal, single-turn actions used purely for data formatting and strict schema adherence.

In this example, we define a Pydantic model for a delivery update and pass it directly to OpenAI's parsing helper. The SDK handles the JSON schema conversion for you, and output_parsed gives you a typed object back when the model succeeds.[5] This snippet requires the OpenAI Python SDK and an OPENAI_API_KEY.

when-to-use-structured-outputs.py
1from openai import OpenAI 2from pydantic import BaseModel 3 4client = OpenAI() 5 6class DeliveryUpdate(BaseModel): 7 order_id: str 8 status: str 9 eta: str 10 11response = client.responses.parse( 12 model="gpt-4o-mini", 13 input="Order A102 is delayed and now expected Friday.", 14 text_format=DeliveryUpdate, 15) 16 17update = response.output_parsed

One provider-specific gotcha: hosted structured-output APIs usually implement a subset of JSON Schema, not the full spec. On OpenAI, that means root-level anyOf is not allowed, every field must be required, and every object must opt into closed-world generation with additionalProperties: false.[5] Treat schema design as part of your API integration, not only a prompt-writing detail.

Structured output validation pipeline separating schema validation from semantic business-rule validation. Structured output validation pipeline separating schema validation from semantic business-rule validation.
Schema-valid JSON is only the first gate. Application code still checks whether the values are true, authorized, and usable before anything downstream trusts them.

The next example keeps these gates separate. Pydantic checks shape and enum values; application code checks whether the order belongs to the authenticated customer.

separate-format-from-semantic-validation.py
1from typing import Literal 2 3from pydantic import BaseModel, ConfigDict 4 5class DeliveryUpdate(BaseModel): 6 model_config = ConfigDict(extra="forbid") 7 8 order_id: str 9 status: Literal["processing", "shipped", "delayed"] 10 11orders_by_customer = {"customer-7": {"A102"}} 12 13def can_show_update(customer_id: str, update: DeliveryUpdate) -> bool: 14 return update.order_id in orders_by_customer.get(customer_id, set()) 15 16parsed = DeliveryUpdate.model_validate({"order_id": "A102", "status": "delayed"}) 17print("format gate:", parsed.model_dump()) 18print("owner allowed:", can_show_update("customer-7", parsed)) 19print("other customer allowed:", can_show_update("customer-8", parsed))
Output
1format gate: {'order_id': 'A102', 'status': 'delayed'} 2owner allowed: True 3other customer allowed: False

Production patterns

Moving from a prototype that occasionally outputs valid JSON to a production system that processes thousands of requests reliably requires defensive engineering. The following patterns address the most common failure modes and lifecycle challenges associated with structured output generation.

1. Preserve useful intermediate evidence for hard tasks

A subtle but important pattern: strict structured output can hurt task quality on some hard problems when the schema is too tight. A controlled study found measurable reasoning-accuracy reductions under format restrictions on evaluated tasks such as GSM8K, with outcomes affected by format and field ordering.[9]

Don't respond by storing unrestricted chain-of-thought. Instead, design product-visible intermediate fields that can be checked: cited evidence, extracted quantities, calculation steps needed by a tutor, or a reason code used by a reviewer. If a workflow genuinely needs those fields, place them before the final decision and validate them independently.

1-validate-visible-evidence-before-decision.py
1from typing import Literal 2 3from pydantic import BaseModel, model_validator 4 5class ReturnDecision(BaseModel): 6 evidence_quote: str 7 order_id: str 8 intent: Literal["return", "exchange", "unknown"] 9 10 @model_validator(mode="after") 11 def cited_order_appears_in_evidence(self) -> "ReturnDecision": 12 if self.order_id not in self.evidence_quote: 13 raise ValueError("order ID is not supported by evidence") 14 return self 15 16decision = ReturnDecision.model_validate( 17 { 18 "evidence_quote": "I want to return the blue jacket from order A102.", 19 "order_id": "A102", 20 "intent": "return", 21 } 22) 23print("decision:", decision.intent) 24print("evidence checked:", decision.order_id in decision.evidence_quote)
Output
1decision: return 2evidence checked: True

This is not about recovering a hidden "true" chain of thought. The contract exposes evidence that a product or reviewer can inspect when the final decision is wrong.

2. Explicit intermediate fields for complex extraction

A common pitfall is forcing a difficult decision into a final field with no checkable support. For extraction, tutoring, or multi-step workflows, explicit evidence or bounded calculation fields can make the result easier to validate before code acts on it.

The schema below follows the same pattern as OpenAI's math-tutor examples: return a bounded list of explicit steps plus the final answer.

2-explicit-intermediate-fields-for-complex.py
1from pydantic import BaseModel 2 3class Step(BaseModel): 4 explanation: str 5 output: str 6 7class MathSolution(BaseModel): 8 steps: list[Step] 9 final_answer: str 10 11# Use this when the intermediate fields are valuable to the application. 12# Keep them intentional and bounded rather than dumping unstructured prose. 13

3. Handling schema evolution

Schemas change. If producers and consumers disagree about a payload version, a rollout can break downstream processing even when each individual response is valid JSON.

Production tip: Version your schemas in application code. Choose the validator before sending the request, then attach the corresponding version to the validated record or require a fixed version literal and verify it. Don't let the model choose which contract it claims to satisfy.

3-route-versioned-records-with-code-owned-contracts.py
1from typing import Literal 2 3from pydantic import BaseModel 4 5class DeliveryV1(BaseModel): 6 order_id: str 7 status: str 8 9class DeliveryV2(BaseModel): 10 order_id: str 11 status: str 12 carrier: str 13 14class VersionedDelivery(BaseModel): 15 schema_version: Literal["2"] 16 payload: DeliveryV2 17 18generated_payload = {"order_id": "A102", "status": "shipped", "carrier": "UPS"} 19validated = DeliveryV2.model_validate(generated_payload) 20wire_record = VersionedDelivery(schema_version="2", payload=validated) 21print("version:", wire_record.schema_version) 22print("carrier:", wire_record.payload.carrier)
Output
1version: 2 2carrier: UPS

4. Recover by failure class, without weakening the contract

Even with structured outputs, failures can happen. The important distinction is between recoverable failures (for example, truncation from max_output_tokens) and policy outcomes such as refusals or content filtering. Don't treat a refusal as an ordinary parse failure and then retry with a looser mode.[5]

Truncation doesn't justify switching from a schema-enforced response to JSON mode: the missing content is still missing and the fallback loses schema enforcement. Prefer a bounded retry with more output budget, a smaller contract, or chunked input. The fake client below mirrors response states so you can unit-test that control flow without an API call:

4-bounded-retry-preserves-schema-contract.py
1from dataclasses import dataclass 2 3from pydantic import BaseModel 4 5class DeliveryUpdate(BaseModel): 6 order_id: str 7 status: str 8 eta: str 9 10@dataclass 11class ContentItem: 12 type: str 13 refusal: str | None = None 14 15@dataclass 16class MessageOutput: 17 content: list[ContentItem] 18 19@dataclass 20class IncompleteDetails: 21 reason: str 22 23@dataclass 24class FakeResponse: 25 status: str 26 output: list[MessageOutput] 27 output_parsed: DeliveryUpdate | None = None 28 incomplete_details: IncompleteDetails | None = None 29 30class FakeResponsesApi: 31 def __init__(self, responses: list[FakeResponse]) -> None: 32 self.responses = iter(responses) 33 34 def parse(self, **kwargs) -> FakeResponse: 35 return next(self.responses) 36 37class FakeClient: 38 def __init__(self, responses: list[FakeResponse]) -> None: 39 self.responses = FakeResponsesApi(responses) 40 41def generate_with_bounded_retry(client: FakeClient, input_text: str) -> DeliveryUpdate: 42 response = client.responses.parse( 43 model="gpt-4o-mini", 44 input=input_text, 45 text_format=DeliveryUpdate, 46 max_output_tokens=120, 47 ) 48 49 first_content = response.output[0].content[0] 50 51 if first_content.type == "refusal": 52 raise RuntimeError(f"policy refusal: {first_content.refusal}") 53 54 if response.status == "completed" and response.output_parsed is not None: 55 return response.output_parsed 56 57 if response.status != "incomplete": 58 raise RuntimeError(f"Unexpected response status: {response.status}") 59 60 if response.incomplete_details is None: 61 raise RuntimeError("Incomplete response did not include a reason") 62 63 if response.incomplete_details.reason != "max_output_tokens": 64 raise RuntimeError( 65 f"Structured output halted: {response.incomplete_details.reason}" 66 ) 67 68 retry = client.responses.parse( 69 model="gpt-4o-mini", 70 input=input_text, 71 text_format=DeliveryUpdate, 72 max_output_tokens=400, 73 ) 74 if retry.status != "completed" or retry.output_parsed is None: 75 raise RuntimeError("bounded retry did not produce a structured result") 76 return retry.output_parsed 77 78output_item = MessageOutput(content=[ContentItem(type="output_text")]) 79incomplete = FakeResponse( 80 status="incomplete", 81 output=[output_item], 82 incomplete_details=IncompleteDetails(reason="max_output_tokens"), 83) 84completed = FakeResponse( 85 status="completed", 86 output=[output_item], 87 output_parsed=DeliveryUpdate(order_id="A102", status="delayed", eta="Friday"), 88) 89update = generate_with_bounded_retry( 90 FakeClient([incomplete, completed]), 91 "Order A102 is delayed and now expected Friday.", 92) 93print("retry preserved contract:", update.model_dump()) 94 95refused = FakeResponse( 96 status="completed", 97 output=[MessageOutput(content=[ContentItem(type="refusal", refusal="blocked")])], 98) 99try: 100 generate_with_bounded_retry(FakeClient([refused]), "disallowed request") 101except RuntimeError as exc: 102 print("refusal routed:", str(exc))
Output
1retry preserved contract: {'order_id': 'A102', 'status': 'delayed', 'eta': 'Friday'} 2refusal routed: policy refusal: blocked
Recovery flow for structured outputs: a strict schema request can use a bounded schema-preserving retry after truncation, while refusals route to policy handling. Recovery flow for structured outputs: a strict schema request can use a bounded schema-preserving retry after truncation, while refusals route to policy handling.
Recovery should preserve the contract. Retry truncation with a bounded budget or smaller task, but route refusals and content-filter stops to policy handling.

5. Flatten deeply nested structures

Deeply nested or recursive schemas can increase grammar state, output length, and debugging complexity. Some hosted APIs support recursive schemas, but support doesn't establish acceptable latency for your workload.[5] Benchmark nested contracts on the target runtime.

Production tip: Keep nesting as shallow as your interface allows, bound recursive outputs, and benchmark the exact schema on your target runtime. If you need tree-shaped data, a flat list of nodes with parent_id references is often easier to generate, validate, and evolve than a deeply recursive JSON object.

5-reconstruct-a-flat-tree-after-validation.py
1from pydantic import BaseModel 2 3class FlatNode(BaseModel): 4 node_id: str 5 parent_id: str | None 6 label: str 7 8nodes = [ 9 FlatNode(node_id="root", parent_id=None, label="shipment"), 10 FlatNode(node_id="n1", parent_id="root", label="carrier"), 11 FlatNode(node_id="n2", parent_id="root", label="ETA"), 12] 13children: dict[str | None, list[str]] = {} 14for node in nodes: 15 children.setdefault(node.parent_id, []).append(node.label) 16 17print("root nodes:", children[None]) 18print("shipment fields:", children["root"])
Output
1root nodes: ['shipment'] 2shipment fields: ['carrier', 'ETA']

Performance considerations

Grammar-guided decoding performs valid-token work during generation, although optimized engines may hide or greatly reduce observable overhead for particular workloads. Schema compliance doesn't make latency irrelevant. Measure on the model, schema, batch shape, and runtime you plan to ship.

Where the latency comes from

The latency impact depends on the schema, tokenizer, and runtime design. The main cost centers are:

Cost sourceWhy it appearsCommon mitigation
Grammar compilationThe runtime has to convert a schema or grammar into an indexable guideCompile once and reuse it across requests
Per-token maskingEach generation step must compute the valid next-token setPrecompute token-prefix tables, compressed FSMs, or categorize context-independent tokens (XGrammar)
First use of a hosted schemaThe provider may preprocess and cache a new schema before generation startsReuse stable schemas and warm hot paths ahead of time
Large prompt prefixesLong schema instructions still consume prefill work and contextUse server-side structured outputs or prefix caching

This is why interviewers often ask about TTFT versus TPOT. Compilation, hosted schema preprocessing, and large prompt prefixes mostly affect TTFT. Token masking affects TPOT. Systems such as Outlines, SGLang, and XGrammar focus on reducing those costs with precomputation, token categorization, and cache reuse rather than ignoring the cost.[1][3][8][5]

The quality-compliance tradeoff

Constraining the output space can affect generation quality because a grammar eliminates paths outside the contract. The effect depends on task, model, and schema; format-restriction studies show it can be measurable on reasoning tasks.[9]

Production tip: Keep your schemas semantically permissive, but structurally stable. Prefer a stable field set with nullable values or bounded enums over a maze of branching object variants. If your provider uses strict mode, supported-schema limits and closed-object requirements become part of the interface contract.[5]

Batched constrained generation

When processing multiple requests with the same schema (common in data extraction pipelines), the FSM can be compiled once and reused across requests. This avoids repeated per-request compilation work when the runtime supports that reuse.

The code below demonstrates the same batching pattern using current Outlines calls. You build the model wrapper once at startup, then reuse it across prompts with the same target type:

batched-constrained-generation.py
1# Step 1: Build the Outlines model once at startup 2# model = outlines.from_transformers(...) 3 4# Step 2: Reuse the model with the same schema for each prompt 5results = [ 6 ExtractedEntity.model_validate_json( 7 model(prompt, output_type=ExtractedEntity, max_new_tokens=120) 8 ) 9 for prompt in batch_prompts 10]

SGLang takes this further with prefix-aware KV cache reuse. If multiple requests share the same prompt prefix, the runtime can reuse prefetched state instead of rebuilding it from scratch.[3]

Debugging common failures

Even with structured outputs, things go wrong. The difference between a prototype and a production system is knowing what failure looks like, why it happens, and how to fix it. This section turns the most common misconceptions into a debugging guide.

"The model returned valid JSON, so the data must be correct"

Symptom: Your pipeline parses the output successfully, but the values are nonsense. A delivery ETA reads "yesterday" for a package that hasn't shipped yet. A product ID doesn't exist in your catalog.

Cause: Structured outputs enforce format, not accuracy. The model can produce valid JSON with correct types while the values are still fabricated. A {"eta": "yesterday"} value satisfies the schema but is wrong for the actual order record.

Fix: Add application-layer semantic validation. Check that dates are in the future, that order IDs exist in your database, and that enum values match your known set. Use Pydantic validators or plain Python assertions after parsing.

"The Markdown wrapper trap"

Symptom: Your JSON parser throws a JSONDecodeError even though the output looks like JSON at a glance.

Cause: The model wrapped the JSON in triple backticks with a json label, or added a preamble like "Here is the result:". When you feed the raw output into json.loads(), the extra characters break parsing.

Fix: Treat unparseable or schema-invalid text as a contract failure. For a legacy prompt-only integration, make a bounded retry through a stronger interface or send the item to review; don't silently slice arbitrary text between braces and trust it as the record.

the-markdown-wrapper-trap.py
1import json 2 3from pydantic import BaseModel, ValidationError 4 5class DeliveryUpdate(BaseModel): 6 order_id: str 7 8def accept_typed_record(raw: str) -> str: 9 try: 10 payload = json.loads(raw) 11 DeliveryUpdate.model_validate(payload) 12 except (json.JSONDecodeError, ValidationError): 13 return "reject: contract not satisfied" 14 return "accept: typed record" 15 16print(accept_typed_record('{"order_id": "A102"}')) 17print(accept_typed_record('Here is the JSON: {"order_id": "A102"}')) 18print(accept_typed_record("```json\n{\"order_id\": \"A102\"}\n```"))
Output
1accept: typed record 2reject: contract not satisfied 3reject: contract not satisfied

Use structured outputs or a grammar-guided runtime when the system must produce the contract directly rather than repair raw prose.

"A refusal is not a parse error"

Symptom: Your fallback cascade retries a refused request with a looser mode, and the model still refuses. You've spent extra tokens and latency for no gain.

Cause: A refusal or content-filter stop is a policy outcome, not a decoding bug. The model (or the safety layer) has decided not to answer. Loosening the schema doesn't change that decision.

Fix: Surface the refusal to your application layer. Route it to a human reviewer, change the input, or return a polite error to the user. Don't treat refusals as retryable parse failures.[5]

"Grammar-guided decoding is always slow"

Symptom: You've heard constrained decoding adds overhead and assume it's too slow for your use case.

Cause: Naive implementations can be slow, but optimized runtimes reduce the cost a lot. The overhead depends on tokenizer alignment, grammar complexity, and whether you get cache hits.

Fix: Benchmark before deciding. The right question isn't "is there overhead?" It's "where is the overhead, and can I amortize it?" If you process many requests with the same schema, compilation cost may be reused. Compare hosted schema APIs and optimized self-hosted runtimes on your latency and compliance targets; provider-managed enforcement hides implementation work but doesn't guarantee lower latency.[1][3][5]

"JSON mode and structured outputs are the same thing"

Symptom: You enabled JSON mode and assumed the output would match your schema. It returned {"foo": "bar"} when you expected {"name": "string", "age": "integer"}.

Cause: JSON mode enforces valid JSON syntax on successful completion, but not schema compliance. The model might return any valid JSON object.

Fix: Use structured outputs or grammar-guided decoding when you need schema adherence. Use JSON mode only when you need syntactic validity and plan to validate the shape yourself.[5]

"I should use structured outputs for every LLM call"

Symptom: You're wrapping every prompt in a Pydantic model, even for creative writing or open-ended Q&A.

Cause: Over-application of a useful technique. Structured outputs shine when the output feeds into code (APIs, databases, downstream processing). For user-facing text responses, free-form generation is often better.

Fix: Use structured outputs when the consumer is code. Use free-form generation when the consumer is a human. Forcing unnecessary structure wastes tokens on syntax characters and may constrain the model's expressiveness.

Mastery check

Evaluation rubric

  • Foundational: Explain why JSON mode enforces syntax while structured outputs enforce supported schema features.
  • Intermediate: Describe how grammar-guided decoding constrains token sampling to valid next tokens.
  • Advanced: Build a constrained generation pipeline with JSON Schema, Pydantic models, or provider-native structured output helpers.
  • Advanced: Analyze tokenizer alignment, schema compilation, prefix caching, TTFT, and TPOT tradeoffs.
  • Advanced: Design a production pipeline that handles schema validation, refusal states, truncation, repair, and semantic post-validation.

Follow-up questions

Your parser gets valid JSON, but the order ID does not exist in your system. Where do you fix the pipeline?

Fix the pipeline after parsing, not by loosening the schema. The format gate already passed. The failure is semantic: the value does not match real system state. Keep structured outputs for the typed contract, then add application checks for ownership, existence, policy, and world-state freshness before you write to a queue or database.

An assistant sometimes needs to fetch live order status and sometimes only return a typed classification. Which interface should you choose?

Use function calling when the application needs the model to propose a tool action and arguments, including flows where a tool is required. Use structured outputs when every answer should end as a terminal record that your code parses once. Strict tool schemas improve argument shape, but authorization and execution policy still stay in application code.

Your provider rejects a schema with root anyOf and open objects. What should you change?

Treat the provider schema subset as part of the contract. Reshape the schema into a supported form, require every field the provider expects, and close objects with additionalProperties: false when the API requires it. Don't assume "valid JSON Schema" means "accepted by this runtime."

You need to add a new optional field during rollout. How do you avoid breaking old consumers?

Version the schema and emit that version in the payload. Then let the backend route records through the matching validator or migration layer. Without explicit versioning, a missing field is ambiguous: it could be an older contract or a broken generation.

A recursive schema works in staging, but latency spikes in production. What should you flatten first?

Flatten the contract before you weaken enforcement. Replace deeply nested trees with smaller objects or a flat list of nodes plus parent_id references. That reduces grammar state, output length, and post-validation complexity while preserving the structure your application needs.

A smaller model satisfies the schema but fills required enum fields with weak guesses. What should you change before loosening constraints?

Keep the structure strict. First simplify the task, split the pipeline into smaller stages, add useful intermediate fields, or upgrade to a stronger model for the hard step. Then add semantic validation so a schema-valid but weak answer does not silently ship downstream.

One strict-schema request is truncated, and another is refused. Which one is retryable?

The truncation case is retryable because it is a transport or length failure. The refusal case is not retryable through a weaker parser because it is a policy result. Handle truncation with a bounded fallback or a larger token budget. Handle refusal with policy logic, escalation, or a controlled user-facing error.

Common pitfalls

  • JSON parses, but keys or nulls still break downstream code. Cause: JSON mode enforced syntax only. Fix: run schema validation after parsing and reject shape drift before the payload reaches application code.
  • The provider rejects the schema before generation starts. Cause: hosted structured-output APIs usually support only a subset of JSON Schema. Fix: reshape root unions, close objects, and design to the runtime's supported subset.
  • Strict mode keeps failing on surprise keys. Cause: the object stayed open when the provider expected additionalProperties: false. Fix: close the object explicitly and treat unknown keys as contract failures.
  • A weaker model fills required enums with brittle guesses. Cause: values were over-constrained before the task was made easier. Fix: keep structure strict, but simplify the task, add useful intermediate fields, or upgrade the model for the hard step.
  • The record is schema-valid but still wrong. Cause: semantic validation was skipped. Fix: check ownership, existence, dates, policy, and external system state before you route, write, or act.
  • Old consumers break during rollout. Cause: the payload never identified which schema version produced it. Fix: emit or route with an explicit version so validators can distinguish "old contract" from "broken response."
  • Refusals are retried with weaker generation rules. Cause: policy outcomes were treated like parser bugs. Fix: route refusals to policy handling, not raw-text fallbacks.
  • Human-facing prose is wrapped in JSON for no reason. Cause: structured outputs were applied even though no code consumes the answer. Fix: reserve strict contracts for machine-consumed responses.
  • TTFT spikes after a new schema launch. Cause: schema compilation or hot-path reuse was ignored. Fix: reuse stable schemas, warm frequent ones, and measure TTFT separately from TPOT.

Practice: Build a newsletter parser

Here's a concrete exercise to test your understanding. Try it before looking at the solution sketch.

Task: You receive a 500-word newsletter about e-commerce logistics. Build a tool that extracts every mentioned company and classifies the sentiment as positive, neutral, or negative.

Requirements

  1. Define a Pydantic model with two fields: company (str) and sentiment (Literal["positive", "neutral", "negative"]).
  2. Use structured outputs or grammar-guided decoding to enforce the schema.
  3. Handle the case where the model returns no mentions (return an empty list, not a null).
  4. Add a post-validation step that checks the company name against a known set of carriers (e.g., FedEx, UPS, DHL). Flag unknown carriers for review.

Input example

"FedEx announced faster ground delivery this quarter. UPS warned of holiday delays. A new startup, ShipFast, claims to beat both."

Expected output shape

expected-output-shape.json
1[ 2 {"company": "FedEx", "sentiment": "positive"}, 3 {"company": "UPS", "sentiment": "negative"}, 4 {"company": "ShipFast", "sentiment": "neutral"} 5]

Solution sketch

Click to expand solution sketch
solution-sketch.py
1from typing import Literal 2from pydantic import BaseModel, Field 3 4class Mention(BaseModel): 5 company: str 6 sentiment: Literal["positive", "neutral", "negative"] 7 8class NewsletterAnalysis(BaseModel): 9 mentions: list[Mention] = Field(default_factory=list) 10 11KNOWN_CARRIERS = {"FedEx", "UPS", "DHL"} 12 13def unknown_carriers(analysis: NewsletterAnalysis) -> list[str]: 14 return [ 15 mention.company 16 for mention in analysis.mentions 17 if mention.company not in KNOWN_CARRIERS 18 ] 19 20analysis = NewsletterAnalysis.model_validate({ 21 "mentions": [ 22 {"company": "FedEx", "sentiment": "positive"}, 23 {"company": "UPS", "sentiment": "negative"}, 24 {"company": "ShipFast", "sentiment": "neutral"}, 25 ] 26}) 27 28empty = NewsletterAnalysis() 29print("mentions:", len(analysis.mentions)) 30print("unknown carriers:", unknown_carriers(analysis)) 31print("empty mentions:", empty.mentions)
Output
1mentions: 3 2unknown carriers: ['ShipFast'] 3empty mentions: []

Key design decisions:

  • Use a list, not a nullable field: mentions: list[Mention] with default_factory=list is safer than a nullable list because the schema can enforce the list shape even when empty.
  • Post-validate sentiment: The schema ensures the value is one of the three literals, but it can't ensure the sentiment is factually correct. Add a second-pass check for high-stakes classifications.
  • Handle truncation: If the newsletter is long and the output hits max_output_tokens, use a bounded schema-preserving retry, chunk the input, or explicitly version a smaller contract.

What this unlocks next

You now understand how to move from "asking for JSON" to enforcing structured output at the token level. You can choose the right enforcement tier for your use case, debug the most common structured-generation failures, and build fallback cascades that handle truncation without wasting tokens on unrecoverable refusals.

Next Step
Continue to ReAct & Plan-and-Execute

There, you'll compare the two core agent control loops: ReAct for tightly coupled tool use, and Plan-and-Execute for longer workflows with explicit planning and replanning. The structured outputs you learned here become the data contracts that feed into those agent loops.

PreviousRAG Security & Access Control
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Efficient Guided Generation for Large Language Models.

Willard, B. T. & Louf, R. · 2023 · arXiv preprint

Outlines Documentation

Outlines Developers · 2026

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

SGLang Structured Outputs Documentation

SGLang Project · 2026

Structured outputs

OpenAI · 2024

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

llama.cpp Grammars Documentation

ggml-org · 2026

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Dong, Y., Ruan, C. F., Cai, Y., et al. · 2024

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Tam, Z. R., Wu, C.-K., Tsai, Y.-L., et al. · 2024