Build reliable LLM interfaces with JSON mode, structured outputs, schema validation, and grammar-guided decoding.
The previous lesson secured which documents can enter a prompt. This lesson secures what leaves the model: typed data that downstream code can parse, validate, and route without fragile string cleanup.
Structured output generation turns free-form model text into typed records that software can validate and route. This article explains why schemas, constrained decoding, and explicit recovery paths matter whenever LLM output feeds another system.
Imagine an e-commerce warehouse that processes customer emails automatically. A customer writes, "I want to return the blue jacket I bought last week. My order number is A102." Your pipeline needs to extract the order ID and the intent so it can route the message to the returns department and look up the order in the database.
You ask an LLM to extract this information and return JSON. If the model replies with, "Sure! Here is the extracted data: { 'order_id': 'A102', 'intent': 'return' }", your parser crashes because of the preamble text. That single failure blocks an entire automation pipeline.
When LLM output feeds software, it must be machine-parseable and validated. A missing comma, an unexpected field name, or a polite introduction can break downstream processing. Structured output generation is the set of techniques that make model outputs conform to a defined schema, replacing fragile string cleanup.
This article covers the practical range from best-effort JSON mode to -level constraint enforcement via grammar-guided decoding [1], open-source runtimes like Outlines and SGLang [2][3][4], and hosted patterns using OpenAI's Structured Outputs feature.[5]
The illustration above shows how a grammar engine filters the model's next-token probabilities. Invalid tokens (like a word when a number is required) are masked to negative infinity, so the sampler can only pick grammar-compliant tokens.
Before looking at solutions, let's see how naive prompting fails. Suppose you send this prompt to a chat model:
1prompt = """
2Extract order_id and intent from this email:
3"I'd like to return order A102. The blue jacket doesn't fit."
4Return JSON.
5"""Here are three real ways this can go wrong:
| Failure mode | Example output | Why it breaks |
|---|---|---|
| Preamble text | "Here is the JSON: { 'order_id': 'A102' }" | The parser sees Here before the brace and throws |
| Markdown wrapper | "json\n{ 'order_id': 'A102' }\n" | The triple backticks and newlines are not valid JSON |
| Wrong shape | { "order_id": "A102", "intent": null } | Your database expects intent to be a string, not null |
None of these are model "hallucinations" in the usual sense. The model followed a weak natural-language request. The problem is that your request was underspecified. You asked for JSON, but you didn't enforce JSON.
The fix is to move from asking to enforcing. The rest of this article shows the enforcement tiers, from weakest to strongest.
Engineers need to choose both the contract exposed to application code and the mechanism that enforces it. These choices aren't a strict ladder: a hosted structured-output API may enforce a JSON Schema using constrained decoding internally, while a self-hosted runtime may expose grammar-guided decoding directly. Production machine-consumed output needs a checked contract, regardless of where enforcement runs.
| Choice | Structural contract | Typical implementation |
|---|---|---|
| Prompt-based ("respond in JSON") | None; best-effort formatting only | Most chat APIs |
| JSON mode | Valid JSON syntax on successful completion, with edge cases to handle | OpenAI JSON mode, similar API flags |
| Strict | Schema-constrained arguments for an action request | Tool-calling APIs and agent runtimes |
| Structured-output API | Schema adherence for supported schemas | OpenAI Structured Outputs, schema-aware SDKs |
| Grammar-guided runtime | Valid path through a supplied grammar or schema translation | Outlines, SGLang, llama.cpp |
To see how these differ in practice, imagine the same returns email. Here's what each tier gives you:
Prompt-based: The model might return valid JSON. It might return a monologue. You need a regex to clean the output, then a JSON parser, then a schema validator, then a retry loop. This is best-effort formatting, not enforcement.
JSON mode: The API constrains successful completions to valid JSON syntax. You still need to explicitly tell the model to produce JSON, detect incomplete outputs, and validate the shape yourself. JSON mode could still return {"foo": "bar"} when you wanted {"order_id": "string", "intent": "string"}. JSON mode is syntax enforcement, not schema enforcement.[5]
Structured-output APIs: The API takes a JSON Schema or Pydantic model and, when the request completes without refusal or truncation, returns output that matches supported schema features. The fields, types, and required properties are enforced within that supported subset.[5]
Grammar-guided decoding: A runtime prevents the sampler from picking tokens that would break its compiled grammar. At every step, invalid tokens have their probabilities set to zero. This describes an enforcement mechanism inside generation, not a stronger semantic guarantee: a grammar-valid payload can still contain wrong values.[1]
This executable comparison demonstrates the remaining gap after JSON syntax succeeds: an object can parse correctly and still fail its application schema.
1import json
2
3from pydantic import BaseModel, ConfigDict, ValidationError
4
5class DeliveryUpdate(BaseModel):
6 model_config = ConfigDict(extra="forbid")
7
8 order_id: str
9 intent: str
10
11syntax_valid_but_wrong_shape = json.loads('{"foo": "bar"}')
12schema_valid = json.loads('{"order_id": "A102", "intent": "return"}')
13
14try:
15 DeliveryUpdate.model_validate(syntax_valid_but_wrong_shape)
16except ValidationError:
17 print("JSON-only result: rejected by schema")
18
19print("structured record:", DeliveryUpdate.model_validate(schema_valid).model_dump())1JSON-only result: rejected by schema
2structured record: {'order_id': 'A102', 'intent': 'return'}Imagine a package moving through a fulfillment lane with no guides. It might reach the right chute, or it might drift into the wrong bin. This is prompt-based generation: you ask the model to "please produce JSON," but nothing stops it from outputting a monologue instead.
Grammar-guided decoding is like a guided sortation lane. Rails prevent the package from entering an invalid chute. In the context of an LLM, the rails are a finite state machine (FSM, a computational model that tracks transitions between allowed states) that monitors generation. If the schema requires an integer for quantity, the FSM blocks all non-digit tokens. The decoder can't select a letter token that violates the grammar, the same way a package can't leave the allowed lane.
To understand how runtimes enforce structural compliance, start by examining the token generation process itself. By intercepting the token sampling phase, the runtime restricts selection to tokens that match the desired schema. The following diagram illustrates how grammar constraints act as a filter during the token generation cycle:
Grammar-guided enforcement operates at the token sampling level. It transforms a schema or grammar into constraints that modify the model's output probabilities during generation.
{"type": "object", "properties": {"age": {"type": "integer"}}}\{"age":\s*[0-9]+\}{"age": .1, 2, 3, ... 9, 0, (space).", a, b, {, [, etc.softmax is applied, their probability becomes 0.The toy code below shows the core masking idea without requiring a model checkpoint. The grammar state says that only { is legal as the next token, so every other candidate is assigned negative infinity before sampling:
1def mask_invalid_tokens(logits: dict[str, float], valid_tokens: set[str]) -> dict[str, float]:
2 return {
3 token: score if token in valid_tokens else float("-inf")
4 for token, score in logits.items()
5 }
6
7next_token_logits = {
8 "The": 0.45,
9 "{": 0.30,
10 "Sure": 0.15,
11 "JSON": 0.10,
12}
13
14masked = mask_invalid_tokens(next_token_logits, valid_tokens={"{"})
15chosen = max(masked, key=masked.get)
16valid_after_mask = [token for token, score in masked.items() if score != float("-inf")]
17print("valid after mask:", valid_after_mask)
18print("chosen token:", chosen)1valid after mask: ['{']
2chosen token: {This enforces structural adherence to the allowed grammar as long as generation can continue normally. In production, you still need to handle refusals, truncation, and semantic mistakes in the returned values.[5]
The clean DFA story above hides the hardest systems detail. Grammars are usually written over characters or bytes, but LLMs don't sample characters. They sample subword tokens. That means the runtime must answer a much harder question than "is } valid here?" It must answer "which of the 100,000+ vocabulary items could extend some valid character prefix from this state?" [1]
That token-to-grammar alignment step is where a lot of the real engineering work lives. Outlines precomputes an index used during guided generation, while SGLang's paper explicitly calls out compressed finite state machines for faster structured decoding.[1][3] If an implementation does this naively, masking can become a noticeable part of per-token latency.
Hosted APIs can hide some of that machinery, but not its cost. OpenAI's docs note this first-schema latency directly: the first request with any new schema can be slower while the API processes the schema, and later requests with the same schema reuse that work. A separate note adds extra first-request latency for fine-tuned models specifically.[5]
When working with flat data structures, regular expressions (Regex) compiled to DFAs work well. But what if your JSON object contains lists of other objects, or deeply nested dictionaries?
Regular-language constraints can't represent arbitrarily nested structures like general JSON. To enforce nested or recursive schemas, engines need a Context-Free Grammar (CFG) or another stack-aware parser representation. Such representations track open brackets [ and { until matching closing brackets ] and }. Modern libraries handle translation automatically; benchmark nested schemas because parser state and output size can affect latency.
While hosted Structured Outputs APIs are convenient, open-source runtimes give you direct control over constrained decoding for self-hosted models. For teams deploying open-weight models, managed API features may not fit. Instead, they use inference engines or libraries that support constrained generation natively and can cache compiled grammars or shared prefixes inside the serving stack.
Outlines [1][2] provides a high-level Python interface for structured generation. In current releases, you build an Outlines model once, then pass the target type when you call it. The model wrapper reuses the tokenizer and generation machinery across calls, while the schema stays explicit at the call site.
This is an integration example, not a no-dependency local script. It requires outlines, transformers, pydantic, a compatible model backend, and enough local compute to load the chosen model.
1from typing import Literal
2
3import outlines
4from pydantic import BaseModel
5from transformers import AutoModelForCausalLM, AutoTokenizer
6
7class Character(BaseModel):
8 name: str
9 role: Literal["Warrior", "Mage", "Rogue"]
10 level: int
11
12model_name = "microsoft/Phi-3-mini-4k-instruct"
13model = outlines.from_transformers(
14 AutoModelForCausalLM.from_pretrained(model_name, device_map="auto"),
15 AutoTokenizer.from_pretrained(model_name),
16)
17
18raw = model(
19 "Create a level-5 fantasy RPG character.",
20 output_type=Character,
21 max_new_tokens=120,
22)
23character = Character.model_validate_json(raw)Enums and required fields can be enforced during decoding, but business rules still belong in post-validation. For example, if level must be between 1 and 100, validate that in your application even if the grammar already narrows the shape.
llama.cpp exposes grammar constraints directly through GBNF and can also convert a subset of JSON Schema into grammars for its server and CLI flows.[6][7] One subtle but important detail from its docs is that the schema is used to constrain decoding, not automatically to teach the model what the fields mean. For plain structured generation, you still want prompt instructions that explain the task and the semantics of the fields you're asking for.[7]
SGLang (Structured Generation Language) [3][4] goes further by optimizing the runtime for structured workloads. Its paper describes two separate ideas that are easy to conflate:
Under the hood, SGLang stores token-prefix mappings in a radix tree and reuses previously computed KV cache when a new request shares a prompt prefix.[3] That helps Time To First Token (TTFT) when many requests reuse the same system prompt, few-shot examples, or schema instructions. It's a prefix-caching optimization, not a grammar-state cache.
SGLang also provides a domain-specific language (DSL, a specialized syntax for a particular application domain) for interleaving Python control flow with LLM generation. Its structured-output documentation exposes JSON Schema, regex, and grammar constraints; its runtime design separately addresses cache reuse.[4][3]
XGrammar [8] is a grammar engine built specifically to make context-free-grammar decoding cheap enough for production serving. It attacks the tokenizer-alignment cost from the previous section head on. The core idea is to split the vocabulary into context-independent tokens, whose validity can be precomputed regardless of the parser stack, and context-dependent tokens, which must be checked at runtime against the current stack. A persistent stack and overlap with GPU execution then shrink the per-token mask cost further.[8]
The paper reports up to 100x faster grammar processing than evaluated prior approaches and near-zero structured-generation overhead in its end-to-end serving experiments.[8] Serving stacks can expose engines such as XGrammar behind a higher-level structured-output interface; verify which backend and benchmark settings your deployment actually uses.
Both techniques constrain model outputs, but they solve different orchestration problems in an AI system. It's common to blur them together because both can involve JSON schemas. Some APIs also allow strict tool-argument schemas, but the core distinction still holds: function calling is about selecting actions, while structured outputs are about returning data in a fixed contract.
| Aspect | Function Calling | Structured Outputs |
|---|---|---|
| Primary Goal | Action Selection (Tool Use) | Data Extraction / Formatting |
| Trigger | Application may allow or require a tool call; model supplies arguments | Application requests a format for the returned record |
| Output | Function name + arguments | Arbitrary JSON object |
| Control Flow | Loop: Model -> Code -> Model | Linear: Model -> Parser -> App |
Use function calling when the model is an agent that needs to interact with the world. In a typical tool loop, the application exposes allowed tools and may let the model choose one or require a specific call. The application executes an accepted call, then returns its result to the model in a subsequent turn. This allows the agent to formulate a final user-facing response based on fetched data.
Here is an example of defining a tool for an agent to fetch order status. If tool choice is left automatic, the model can either call the function with structured arguments or return a normal text response:
1tools = [
2 {
3 "type": "function",
4 "name": "get_order_status",
5 "description": "Fetch current fulfillment status for an order.",
6 "strict": True,
7 "parameters": {
8 "type": "object",
9 "properties": {
10 "order_id": {"type": "string"}
11 },
12 "required": ["order_id"],
13 "additionalProperties": False
14 }
15 }
16]
17
18tool_schema = tools[0]
19example_routes = {
20 "Where is order A102?": "get_order_status",
21 "Hi!": "text_response",
22}
23print("tool:", tool_schema["name"])
24print("strict:", tool_schema["strict"])
25print("routes:", example_routes)1tool: get_order_status
2strict: True
3routes: {'Where is order A102?': 'get_order_status', 'Hi!': 'text_response'}If your provider supports strict tool schemas, enable them. That improves argument reliability, but the control flow is still tool invocation rather than terminal data extraction. On OpenAI, manual strict tool schemas should set strict: true, list every parameter in required, and close objects with additionalProperties: false. parallel_tool_calls=false is useful when your application expects zero or one call, but that's an orchestration choice, not the definition of strictness.[5]
Use structured outputs when you need to extract data or ensure a reliable interface between the LLM and your code. Unlike the multi-turn loop of function calling, structured outputs are typically terminal, single-turn actions used purely for data formatting and strict schema adherence.
In this example, we define a Pydantic model for a delivery update and pass it directly to OpenAI's parsing helper. The SDK handles the JSON schema conversion for you, and output_parsed gives you a typed object back when the model succeeds.[5] This snippet requires the OpenAI Python SDK and an OPENAI_API_KEY.
1from openai import OpenAI
2from pydantic import BaseModel
3
4client = OpenAI()
5
6class DeliveryUpdate(BaseModel):
7 order_id: str
8 status: str
9 eta: str
10
11response = client.responses.parse(
12 model="gpt-4o-mini",
13 input="Order A102 is delayed and now expected Friday.",
14 text_format=DeliveryUpdate,
15)
16
17update = response.output_parsedOne provider-specific gotcha: hosted structured-output APIs usually implement a subset of JSON Schema, not the full spec. On OpenAI, that means root-level anyOf is not allowed, every field must be required, and every object must opt into closed-world generation with additionalProperties: false.[5] Treat schema design as part of your API integration, not only a prompt-writing detail.
The next example keeps these gates separate. Pydantic checks shape and enum values; application code checks whether the order belongs to the authenticated customer.
1from typing import Literal
2
3from pydantic import BaseModel, ConfigDict
4
5class DeliveryUpdate(BaseModel):
6 model_config = ConfigDict(extra="forbid")
7
8 order_id: str
9 status: Literal["processing", "shipped", "delayed"]
10
11orders_by_customer = {"customer-7": {"A102"}}
12
13def can_show_update(customer_id: str, update: DeliveryUpdate) -> bool:
14 return update.order_id in orders_by_customer.get(customer_id, set())
15
16parsed = DeliveryUpdate.model_validate({"order_id": "A102", "status": "delayed"})
17print("format gate:", parsed.model_dump())
18print("owner allowed:", can_show_update("customer-7", parsed))
19print("other customer allowed:", can_show_update("customer-8", parsed))1format gate: {'order_id': 'A102', 'status': 'delayed'}
2owner allowed: True
3other customer allowed: FalseMoving from a prototype that occasionally outputs valid JSON to a production system that processes thousands of requests reliably requires defensive engineering. The following patterns address the most common failure modes and lifecycle challenges associated with structured output generation.
A subtle but important pattern: strict structured output can hurt task quality on some hard problems when the schema is too tight. A controlled study found measurable reasoning-accuracy reductions under format restrictions on evaluated tasks such as GSM8K, with outcomes affected by format and field ordering.[9]
Don't respond by storing unrestricted chain-of-thought. Instead, design product-visible intermediate fields that can be checked: cited evidence, extracted quantities, calculation steps needed by a tutor, or a reason code used by a reviewer. If a workflow genuinely needs those fields, place them before the final decision and validate them independently.
1from typing import Literal
2
3from pydantic import BaseModel, model_validator
4
5class ReturnDecision(BaseModel):
6 evidence_quote: str
7 order_id: str
8 intent: Literal["return", "exchange", "unknown"]
9
10 @model_validator(mode="after")
11 def cited_order_appears_in_evidence(self) -> "ReturnDecision":
12 if self.order_id not in self.evidence_quote:
13 raise ValueError("order ID is not supported by evidence")
14 return self
15
16decision = ReturnDecision.model_validate(
17 {
18 "evidence_quote": "I want to return the blue jacket from order A102.",
19 "order_id": "A102",
20 "intent": "return",
21 }
22)
23print("decision:", decision.intent)
24print("evidence checked:", decision.order_id in decision.evidence_quote)1decision: return
2evidence checked: TrueThis is not about recovering a hidden "true" chain of thought. The contract exposes evidence that a product or reviewer can inspect when the final decision is wrong.
A common pitfall is forcing a difficult decision into a final field with no checkable support. For extraction, tutoring, or multi-step workflows, explicit evidence or bounded calculation fields can make the result easier to validate before code acts on it.
The schema below follows the same pattern as OpenAI's math-tutor examples: return a bounded list of explicit steps plus the final answer.
1from pydantic import BaseModel
2
3class Step(BaseModel):
4 explanation: str
5 output: str
6
7class MathSolution(BaseModel):
8 steps: list[Step]
9 final_answer: str
10
11# Use this when the intermediate fields are valuable to the application.
12# Keep them intentional and bounded rather than dumping unstructured prose.
13Schemas change. If producers and consumers disagree about a payload version, a rollout can break downstream processing even when each individual response is valid JSON.
Production tip: Version your schemas in application code. Choose the validator before sending the request, then attach the corresponding version to the validated record or require a fixed version literal and verify it. Don't let the model choose which contract it claims to satisfy.
1from typing import Literal
2
3from pydantic import BaseModel
4
5class DeliveryV1(BaseModel):
6 order_id: str
7 status: str
8
9class DeliveryV2(BaseModel):
10 order_id: str
11 status: str
12 carrier: str
13
14class VersionedDelivery(BaseModel):
15 schema_version: Literal["2"]
16 payload: DeliveryV2
17
18generated_payload = {"order_id": "A102", "status": "shipped", "carrier": "UPS"}
19validated = DeliveryV2.model_validate(generated_payload)
20wire_record = VersionedDelivery(schema_version="2", payload=validated)
21print("version:", wire_record.schema_version)
22print("carrier:", wire_record.payload.carrier)1version: 2
2carrier: UPSEven with structured outputs, failures can happen. The important distinction is between recoverable failures (for example, truncation from max_output_tokens) and policy outcomes such as refusals or content filtering. Don't treat a refusal as an ordinary parse failure and then retry with a looser mode.[5]
Truncation doesn't justify switching from a schema-enforced response to JSON mode: the missing content is still missing and the fallback loses schema enforcement. Prefer a bounded retry with more output budget, a smaller contract, or chunked input. The fake client below mirrors response states so you can unit-test that control flow without an API call:
1from dataclasses import dataclass
2
3from pydantic import BaseModel
4
5class DeliveryUpdate(BaseModel):
6 order_id: str
7 status: str
8 eta: str
9
10@dataclass
11class ContentItem:
12 type: str
13 refusal: str | None = None
14
15@dataclass
16class MessageOutput:
17 content: list[ContentItem]
18
19@dataclass
20class IncompleteDetails:
21 reason: str
22
23@dataclass
24class FakeResponse:
25 status: str
26 output: list[MessageOutput]
27 output_parsed: DeliveryUpdate | None = None
28 incomplete_details: IncompleteDetails | None = None
29
30class FakeResponsesApi:
31 def __init__(self, responses: list[FakeResponse]) -> None:
32 self.responses = iter(responses)
33
34 def parse(self, **kwargs) -> FakeResponse:
35 return next(self.responses)
36
37class FakeClient:
38 def __init__(self, responses: list[FakeResponse]) -> None:
39 self.responses = FakeResponsesApi(responses)
40
41def generate_with_bounded_retry(client: FakeClient, input_text: str) -> DeliveryUpdate:
42 response = client.responses.parse(
43 model="gpt-4o-mini",
44 input=input_text,
45 text_format=DeliveryUpdate,
46 max_output_tokens=120,
47 )
48
49 first_content = response.output[0].content[0]
50
51 if first_content.type == "refusal":
52 raise RuntimeError(f"policy refusal: {first_content.refusal}")
53
54 if response.status == "completed" and response.output_parsed is not None:
55 return response.output_parsed
56
57 if response.status != "incomplete":
58 raise RuntimeError(f"Unexpected response status: {response.status}")
59
60 if response.incomplete_details is None:
61 raise RuntimeError("Incomplete response did not include a reason")
62
63 if response.incomplete_details.reason != "max_output_tokens":
64 raise RuntimeError(
65 f"Structured output halted: {response.incomplete_details.reason}"
66 )
67
68 retry = client.responses.parse(
69 model="gpt-4o-mini",
70 input=input_text,
71 text_format=DeliveryUpdate,
72 max_output_tokens=400,
73 )
74 if retry.status != "completed" or retry.output_parsed is None:
75 raise RuntimeError("bounded retry did not produce a structured result")
76 return retry.output_parsed
77
78output_item = MessageOutput(content=[ContentItem(type="output_text")])
79incomplete = FakeResponse(
80 status="incomplete",
81 output=[output_item],
82 incomplete_details=IncompleteDetails(reason="max_output_tokens"),
83)
84completed = FakeResponse(
85 status="completed",
86 output=[output_item],
87 output_parsed=DeliveryUpdate(order_id="A102", status="delayed", eta="Friday"),
88)
89update = generate_with_bounded_retry(
90 FakeClient([incomplete, completed]),
91 "Order A102 is delayed and now expected Friday.",
92)
93print("retry preserved contract:", update.model_dump())
94
95refused = FakeResponse(
96 status="completed",
97 output=[MessageOutput(content=[ContentItem(type="refusal", refusal="blocked")])],
98)
99try:
100 generate_with_bounded_retry(FakeClient([refused]), "disallowed request")
101except RuntimeError as exc:
102 print("refusal routed:", str(exc))1retry preserved contract: {'order_id': 'A102', 'status': 'delayed', 'eta': 'Friday'}
2refusal routed: policy refusal: blocked
Deeply nested or recursive schemas can increase grammar state, output length, and debugging complexity. Some hosted APIs support recursive schemas, but support doesn't establish acceptable latency for your workload.[5] Benchmark nested contracts on the target runtime.
Production tip: Keep nesting as shallow as your interface allows, bound recursive outputs, and benchmark the exact schema on your target runtime. If you need tree-shaped data, a flat list of nodes with
parent_idreferences is often easier to generate, validate, and evolve than a deeply recursive JSON object.
1from pydantic import BaseModel
2
3class FlatNode(BaseModel):
4 node_id: str
5 parent_id: str | None
6 label: str
7
8nodes = [
9 FlatNode(node_id="root", parent_id=None, label="shipment"),
10 FlatNode(node_id="n1", parent_id="root", label="carrier"),
11 FlatNode(node_id="n2", parent_id="root", label="ETA"),
12]
13children: dict[str | None, list[str]] = {}
14for node in nodes:
15 children.setdefault(node.parent_id, []).append(node.label)
16
17print("root nodes:", children[None])
18print("shipment fields:", children["root"])1root nodes: ['shipment']
2shipment fields: ['carrier', 'ETA']Grammar-guided decoding performs valid-token work during generation, although optimized engines may hide or greatly reduce observable overhead for particular workloads. Schema compliance doesn't make latency irrelevant. Measure on the model, schema, batch shape, and runtime you plan to ship.
The latency impact depends on the schema, tokenizer, and runtime design. The main cost centers are:
| Cost source | Why it appears | Common mitigation |
|---|---|---|
| Grammar compilation | The runtime has to convert a schema or grammar into an indexable guide | Compile once and reuse it across requests |
| Per-token masking | Each generation step must compute the valid next-token set | Precompute token-prefix tables, compressed FSMs, or categorize context-independent tokens (XGrammar) |
| First use of a hosted schema | The provider may preprocess and cache a new schema before generation starts | Reuse stable schemas and warm hot paths ahead of time |
| Large prompt prefixes | Long schema instructions still consume prefill work and context | Use server-side structured outputs or prefix caching |
This is why interviewers often ask about TTFT versus TPOT. Compilation, hosted schema preprocessing, and large prompt prefixes mostly affect TTFT. Token masking affects TPOT. Systems such as Outlines, SGLang, and XGrammar focus on reducing those costs with precomputation, token categorization, and cache reuse rather than ignoring the cost.[1][3][8][5]
Constraining the output space can affect generation quality because a grammar eliminates paths outside the contract. The effect depends on task, model, and schema; format-restriction studies show it can be measurable on reasoning tasks.[9]
Production tip: Keep your schemas semantically permissive, but structurally stable. Prefer a stable field set with nullable values or bounded enums over a maze of branching object variants. If your provider uses strict mode, supported-schema limits and closed-object requirements become part of the interface contract.[5]
When processing multiple requests with the same schema (common in data extraction pipelines), the FSM can be compiled once and reused across requests. This avoids repeated per-request compilation work when the runtime supports that reuse.
The code below demonstrates the same batching pattern using current Outlines calls. You build the model wrapper once at startup, then reuse it across prompts with the same target type:
1# Step 1: Build the Outlines model once at startup
2# model = outlines.from_transformers(...)
3
4# Step 2: Reuse the model with the same schema for each prompt
5results = [
6 ExtractedEntity.model_validate_json(
7 model(prompt, output_type=ExtractedEntity, max_new_tokens=120)
8 )
9 for prompt in batch_prompts
10]SGLang takes this further with prefix-aware KV cache reuse. If multiple requests share the same prompt prefix, the runtime can reuse prefetched state instead of rebuilding it from scratch.[3]
Even with structured outputs, things go wrong. The difference between a prototype and a production system is knowing what failure looks like, why it happens, and how to fix it. This section turns the most common misconceptions into a debugging guide.
Symptom: Your pipeline parses the output successfully, but the values are nonsense. A delivery ETA reads "yesterday" for a package that hasn't shipped yet. A product ID doesn't exist in your catalog.
Cause: Structured outputs enforce format, not accuracy. The model can produce valid JSON with correct types while the values are still fabricated. A {"eta": "yesterday"} value satisfies the schema but is wrong for the actual order record.
Fix: Add application-layer semantic validation. Check that dates are in the future, that order IDs exist in your database, and that enum values match your known set. Use Pydantic validators or plain Python assertions after parsing.
Symptom: Your JSON parser throws a JSONDecodeError even though the output looks like JSON at a glance.
Cause: The model wrapped the JSON in triple backticks with a json label, or added a preamble like "Here is the result:". When you feed the raw output into json.loads(), the extra characters break parsing.
Fix: Treat unparseable or schema-invalid text as a contract failure. For a legacy prompt-only integration, make a bounded retry through a stronger interface or send the item to review; don't silently slice arbitrary text between braces and trust it as the record.
1import json
2
3from pydantic import BaseModel, ValidationError
4
5class DeliveryUpdate(BaseModel):
6 order_id: str
7
8def accept_typed_record(raw: str) -> str:
9 try:
10 payload = json.loads(raw)
11 DeliveryUpdate.model_validate(payload)
12 except (json.JSONDecodeError, ValidationError):
13 return "reject: contract not satisfied"
14 return "accept: typed record"
15
16print(accept_typed_record('{"order_id": "A102"}'))
17print(accept_typed_record('Here is the JSON: {"order_id": "A102"}'))
18print(accept_typed_record("```json\n{\"order_id\": \"A102\"}\n```"))1accept: typed record
2reject: contract not satisfied
3reject: contract not satisfiedUse structured outputs or a grammar-guided runtime when the system must produce the contract directly rather than repair raw prose.
Symptom: Your fallback cascade retries a refused request with a looser mode, and the model still refuses. You've spent extra tokens and latency for no gain.
Cause: A refusal or content-filter stop is a policy outcome, not a decoding bug. The model (or the safety layer) has decided not to answer. Loosening the schema doesn't change that decision.
Fix: Surface the refusal to your application layer. Route it to a human reviewer, change the input, or return a polite error to the user. Don't treat refusals as retryable parse failures.[5]
Symptom: You've heard constrained decoding adds overhead and assume it's too slow for your use case.
Cause: Naive implementations can be slow, but optimized runtimes reduce the cost a lot. The overhead depends on tokenizer alignment, grammar complexity, and whether you get cache hits.
Fix: Benchmark before deciding. The right question isn't "is there overhead?" It's "where is the overhead, and can I amortize it?" If you process many requests with the same schema, compilation cost may be reused. Compare hosted schema APIs and optimized self-hosted runtimes on your latency and compliance targets; provider-managed enforcement hides implementation work but doesn't guarantee lower latency.[1][3][5]
Symptom: You enabled JSON mode and assumed the output would match your schema. It returned {"foo": "bar"} when you expected {"name": "string", "age": "integer"}.
Cause: JSON mode enforces valid JSON syntax on successful completion, but not schema compliance. The model might return any valid JSON object.
Fix: Use structured outputs or grammar-guided decoding when you need schema adherence. Use JSON mode only when you need syntactic validity and plan to validate the shape yourself.[5]
Symptom: You're wrapping every prompt in a Pydantic model, even for creative writing or open-ended Q&A.
Cause: Over-application of a useful technique. Structured outputs shine when the output feeds into code (APIs, databases, downstream processing). For user-facing text responses, free-form generation is often better.
Fix: Use structured outputs when the consumer is code. Use free-form generation when the consumer is a human. Forcing unnecessary structure wastes tokens on syntax characters and may constrain the model's expressiveness.
Fix the pipeline after parsing, not by loosening the schema. The format gate already passed. The failure is semantic: the value does not match real system state. Keep structured outputs for the typed contract, then add application checks for ownership, existence, policy, and world-state freshness before you write to a queue or database.
Use function calling when the application needs the model to propose a tool action and arguments, including flows where a tool is required. Use structured outputs when every answer should end as a terminal record that your code parses once. Strict tool schemas improve argument shape, but authorization and execution policy still stay in application code.
anyOf and open objects. What should you change?Treat the provider schema subset as part of the contract. Reshape the schema into a supported form, require every field the provider expects, and close objects with additionalProperties: false when the API requires it. Don't assume "valid JSON Schema" means "accepted by this runtime."
Version the schema and emit that version in the payload. Then let the backend route records through the matching validator or migration layer. Without explicit versioning, a missing field is ambiguous: it could be an older contract or a broken generation.
Flatten the contract before you weaken enforcement. Replace deeply nested trees with smaller objects or a flat list of nodes plus parent_id references. That reduces grammar state, output length, and post-validation complexity while preserving the structure your application needs.
Keep the structure strict. First simplify the task, split the pipeline into smaller stages, add useful intermediate fields, or upgrade to a stronger model for the hard step. Then add semantic validation so a schema-valid but weak answer does not silently ship downstream.
The truncation case is retryable because it is a transport or length failure. The refusal case is not retryable through a weaker parser because it is a policy result. Handle truncation with a bounded fallback or a larger token budget. Handle refusal with policy logic, escalation, or a controlled user-facing error.
additionalProperties: false. Fix: close the object explicitly and treat unknown keys as contract failures.Here's a concrete exercise to test your understanding. Try it before looking at the solution sketch.
Task: You receive a 500-word newsletter about e-commerce logistics. Build a tool that extracts every mentioned company and classifies the sentiment as positive, neutral, or negative.
company (str) and sentiment (Literal["positive", "neutral", "negative"])."FedEx announced faster ground delivery this quarter. UPS warned of holiday delays. A new startup, ShipFast, claims to beat both."
1[
2 {"company": "FedEx", "sentiment": "positive"},
3 {"company": "UPS", "sentiment": "negative"},
4 {"company": "ShipFast", "sentiment": "neutral"}
5]1from typing import Literal
2from pydantic import BaseModel, Field
3
4class Mention(BaseModel):
5 company: str
6 sentiment: Literal["positive", "neutral", "negative"]
7
8class NewsletterAnalysis(BaseModel):
9 mentions: list[Mention] = Field(default_factory=list)
10
11KNOWN_CARRIERS = {"FedEx", "UPS", "DHL"}
12
13def unknown_carriers(analysis: NewsletterAnalysis) -> list[str]:
14 return [
15 mention.company
16 for mention in analysis.mentions
17 if mention.company not in KNOWN_CARRIERS
18 ]
19
20analysis = NewsletterAnalysis.model_validate({
21 "mentions": [
22 {"company": "FedEx", "sentiment": "positive"},
23 {"company": "UPS", "sentiment": "negative"},
24 {"company": "ShipFast", "sentiment": "neutral"},
25 ]
26})
27
28empty = NewsletterAnalysis()
29print("mentions:", len(analysis.mentions))
30print("unknown carriers:", unknown_carriers(analysis))
31print("empty mentions:", empty.mentions)1mentions: 3
2unknown carriers: ['ShipFast']
3empty mentions: []Key design decisions:
mentions: list[Mention] with default_factory=list is safer than a nullable list because the schema can enforce the list shape even when empty.max_output_tokens, use a bounded schema-preserving retry, chunk the input, or explicitly version a smaller contract.You now understand how to move from "asking for JSON" to enforcing structured output at the token level. You can choose the right enforcement tier for your use case, debug the most common structured-generation failures, and build fallback cascades that handle truncation without wasting tokens on unrecoverable refusals.
Efficient Guided Generation for Large Language Models.
Willard, B. T. & Louf, R. · 2023 · arXiv preprint
Outlines Documentation
Outlines Developers · 2026
SGLang: Efficient Execution of Structured Language Model Programs.
Zheng, L., et al. · 2023
SGLang Structured Outputs Documentation
SGLang Project · 2026
Structured outputs
OpenAI · 2024
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
llama.cpp Grammars Documentation
ggml-org · 2026
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
Dong, Y., Ruan, C. F., Cai, Y., et al. · 2024
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Tam, Z. R., Wu, C.-K., Tsai, Y.-L., et al. · 2024