LearnAdvanced Agents & RetrievalStructured Output Generation

🤖HardLLM Agents & Tool Use

Structured Output Generation

Build reliable LLM interfaces with JSON mode, structured outputs, schema validation, and grammar-guided decoding.

41 min read

Learning path

Step 115 of 158 in the full curriculum

RAG Security & Access Control ReAct & Plan-and-Execute

A secure RAG prompt controls which documents enter the model. Structured output controls what leaves it: typed data that downstream code can parse, validate, and route without fragile string cleanup.

Structured output generation turns free-form model text into typed records that software can validate and route. Schemas, constrained decoding, and explicit recovery paths matter whenever LLM output feeds another system.

A CI triage pipeline that processes build logs automatically needs typed fields. A developer writes, "Run RUN-842 failed in unit-tests after auth-cache timed out." Your pipeline needs to extract the run ID and failure type so it can route the incident to the right workflow and look up the run in CI.

You ask an LLM to extract this information and return JSON. If the model replies with, "Extracted data: { 'run_id': 'RUN-842', 'failure_type': 'test_failure' }", your parser crashes because of the preamble text. That single failure blocks an entire automation pipeline.

When LLM output feeds software, it must be machine-parseable and validated. A missing comma, an unexpected field name, or a polite introduction can break downstream processing. Structured output generation is the set of techniques that make model outputs conform to a defined schema, replacing fragile string cleanup.

The practical range runs from best-effort JSON mode to token-level constraint enforcement via grammar-guided decoding ^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}, open-source runtimes like Outlines and SGLang ^{[2]Reference 2Outlines Documentationhttps://dottxt-ai.github.io/outlines/latest/}^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}^{[4]Reference 4SGLang Structured Outputs Documentationhttps://docs.sglang.io/docs/advanced_features/structured_outputs}, and hosted patterns using OpenAI's Structured Outputs feature.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Constrained decoding trace where competing next-token proposals hit a grammar mask, illegal tokens drop out, and only one legal JSON path continues. — Grammar-guided decoding changes the sampler itself. Invalid next tokens disappear before sampling, so only legal structure can continue.

The illustration above shows how a grammar engine filters the model's next-token probabilities. Invalid tokens (like a word when a number is required) are masked to negative infinity, so the sampler can only pick grammar-compliant tokens.

Why "please return JSON" isn't enough

Before looking at solutions, watch how naive prompting fails. Suppose you send this prompt to a chat model:

why-please-return-json-is-not-enough.py

prompt = """
Extract run_id and failure_type from this build log:
"Run RUN-842 failed in unit-tests after auth-cache timed out."
Return JSON.
"""

Three real failures look like this:

Failure mode	Example output	Why it breaks
Preamble text	"Here is the JSON: `{ 'run_id': 'RUN-842' }`"	The parser sees `Here` before the brace and throws
Markdown wrapper	"`json\n{ 'run_id': 'RUN-842' }\n`"	The triple backticks and newlines aren't valid JSON
Wrong shape	`{ "run_id": "RUN-842", "failure_type": null }`	Your database expects `failure_type` to be a string, not null

None of these are model "hallucinations" in the usual sense. The model followed a weak natural-language request. Your request was underspecified. You asked for JSON, but you didn't enforce JSON.

Move from asking to enforcing. The enforcement tiers run from weakest to strongest.

Contract choices and enforcement mechanisms

Engineers need to choose both the contract exposed to application code and the mechanism that enforces it. These choices aren't a strict ladder: a hosted structured-output API may enforce a JSON Schema using constrained decoding internally, while a self-hosted runtime may expose grammar-guided decoding directly. Production machine-consumed output needs a checked contract, regardless of where enforcement runs.

Choice	Structural contract	Typical implementation
Prompt-based ("respond in JSON")	None; best-effort formatting only	Most chat APIs
JSON mode	Valid JSON syntax on successful completion, with edge cases to handle	OpenAI JSON mode, similar API flags
Strict function calling	Schema-constrained arguments for an action request	Tool-calling APIs and agent runtimes
Structured-output API	Schema adherence for supported schemas	OpenAI Structured Outputs, schema-aware SDKs
Grammar-guided runtime	Valid path through a supplied grammar or schema translation	Outlines, SGLang, llama.cpp

A structured-output ladder comparing prompt-only formatting, JSON mode, structured-output APIs, and grammar runtimes from weak asking to strong token-level enforcement. — These tiers differ in where structure gets enforced. Prompts ask, JSON mode guarantees syntax, structured APIs check supported schema rules, and grammar runtimes block invalid token paths during sampling.

To see how these differ in practice, imagine the same CI failure log. Each tier changes where enforcement happens:

Prompt-based: The model might return valid JSON. It might return a monologue. You need a regex to clean the output, then a JSON parser, then a schema validator, then a retry loop. This is best-effort formatting, not enforcement.

JSON mode: The API constrains successful completions to valid JSON syntax. You still need to explicitly tell the model to produce JSON, detect incomplete outputs, and validate the shape yourself. JSON mode could still return {"foo": "bar"} when you wanted {"run_id": "string", "failure_type": "string"}. JSON mode is syntax enforcement, not schema enforcement.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Structured-output APIs: The API takes a JSON Schema or Pydantic model and, when the request completes without refusal or truncation, returns output that matches supported schema features. The fields, types, and required properties are enforced within that supported subset.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Grammar-guided decoding: A runtime prevents the sampler from picking tokens that would break its compiled grammar. At every step, invalid tokens have their probabilities set to zero. This describes an enforcement mechanism inside generation, not a stronger semantic guarantee: a grammar-valid payload can still contain wrong values.^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}

This executable comparison demonstrates the remaining gap after JSON syntax succeeds: an object can parse correctly and still fail its application schema.

json-syntax-versus-schema-contract.py

import json

from pydantic import BaseModel, ConfigDict, ValidationError

class RunTriage(BaseModel):
    model_config = ConfigDict(extra="forbid")

    run_id: str
    failure_type: str

syntax_valid_but_wrong_shape = json.loads('{"foo": "bar"}')
schema_valid = json.loads('{"run_id": "RUN-842", "failure_type": "test_failure"}')

try:
    RunTriage.model_validate(syntax_valid_but_wrong_shape)
except ValidationError:
    print("JSON-only result: rejected by schema")

print("structured record:", RunTriage.model_validate(schema_valid).model_dump())

Output

JSON-only result: rejected by schema
structured record: {'run_id': 'RUN-842', 'failure_type': 'test_failure'}

The parser-gate analogy

Source code moving through a compiler parser with no grammar might be a valid statement, or it might be text that only looks close. This is prompt-based generation: you ask the model to "please produce JSON," but nothing stops it from outputting a monologue instead.

Grammar-guided decoding is like a parser gate during compilation. The grammar prevents the output from taking an invalid syntax path. In the context of an LLM, the gate is a finite state machine (FSM, a computational model that tracks transitions between allowed states) that monitors generation. Once the parser state expects a non-negative integer for retry_count, the FSM blocks letter tokens and any other token prefix that can't extend a legal integer. The decoder can't select a letter token that violates the grammar, the same way a compiler parser rejects a malformed statement.

How grammar-guided decoding works

To understand how runtimes enforce structural compliance, start by examining the token generation process itself. By intercepting the token sampling phase, the runtime restricts selection to tokens that match the desired schema. Grammar constraints act as a filter during token generation:

Diagram showing LLM proposes next token, Grammar state, Sample valid token, and Mask invalid to -∞. — LLM proposes next token, Grammar state, Sample valid token, and Mask invalid to -∞.

Grammar-guided enforcement operates at the token sampling level. It transforms a schema or grammar into constraints that modify the model's output probabilities during generation.

From schema to logit masking

Schema Compilation: For regular languages such as Regex, the runtime can compile the allowed strings into a Deterministic Finite Automaton (DFA) (a state machine where the next state is uniquely determined by the current state and input symbol). For general JSON schemas, especially nested or recursive ones, many systems instead compile to a context-free grammar or another stack-aware parser representation.
- Example Schema: {"type": "object", "properties": {"age": {"type": "integer", "minimum": 0}}}
- Simplified regular-expression equivalent: \{"age":\s*(0|[1-9][0-9]*)\}
State Tracking: As the model generates tokens, the engine tracks the current DFA state or parser stack for the active grammar.
Logit Masking: At each step, the engine identifies which tokens are valid transitions.
- Scenario: The model has generated {"age": .
- Valid Next Characters / Bytes: 1, 2, 3, ... 9, 0, (space).
- Invalid Next Characters / Bytes: ", a, b, {, [, etc.
- Action: The runtime maps those allowed prefixes back to token IDs, then sets the logits (unnormalized probabilities) of all invalid token IDs to $-\infty$ . When softmax is applied, their probability becomes 0.

The toy code below shows the core masking idea without requiring a model checkpoint. The grammar state says that only { is legal as the next token, so every other candidate is assigned negative infinity before sampling:

from-schema-to-logit-masking.py

def mask_invalid_tokens(logits: dict[str, float], valid_tokens: set[str]) -> dict[str, float]:
    return {
        token: score if token in valid_tokens else float("-inf")
        for token, score in logits.items()
    }

next_token_logits = {
    "The": 0.45,
    "{": 0.30,
    "Sure": 0.15,
    "JSON": 0.10,
}

masked = mask_invalid_tokens(next_token_logits, valid_tokens={"{"})
chosen = max(masked, key=masked.get)
valid_after_mask = [token for token, score in masked.items() if score != float("-inf")]
print("valid after mask:", valid_after_mask)
print("chosen token:", chosen)

Output

valid after mask: ['{']
chosen token: {

This enforces structural adherence to the allowed grammar as long as generation can continue normally. In production, you still need to handle refusals, truncation, and semantic mistakes in the returned values.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

The tokenizer alignment problem

The clean DFA story above hides the hardest systems detail. Grammars are usually written over characters or bytes, but LLMs don't sample characters. They sample subword tokens. That tokenization boundary means the runtime must answer a much harder question than "is } valid here?" It must answer "which of the 100,000+ vocabulary items could extend some valid character prefix from this state?" ^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}

That token-to-grammar alignment step is where a lot of the real engineering work lives. Outlines precomputes an index used during guided generation, while SGLang's paper explicitly calls out compressed finite state machines for faster structured decoding.^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104} If an implementation does this naively, masking can become a noticeable part of per-token latency.

Hosted APIs can hide some of that machinery. OpenAI's current docs include a general first-schema latency caveat for Structured Outputs: the first request with a schema can add processing latency, while later requests with the same schema avoid that extra work. The same page also notes a fine-tuned-model-specific version of that caveat for response_format, where other models don't have that particular limitation.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Regex vs. context-free grammars

When working with flat data structures, regular expressions (Regex) compiled to DFAs work well. But what if your JSON object contains object lists or several nesting levels?

Regular-language constraints can't represent arbitrarily nested structures like general JSON. To enforce nested or recursive schemas, engines need a Context-Free Grammar (CFG) or another stack-aware parser representation. Such representations track open brackets [ and { until matching closing brackets ] and }. Modern libraries handle translation automatically; benchmark nested schemas because parser state and output size can affect latency.

Open-source engines: Outlines, llama.cpp, SGLang, and XGrammar

While hosted Structured Outputs APIs are convenient, open-source runtimes give you direct control over constrained decoding for self-hosted models. For teams deploying open-weight models, managed API features may not fit. Instead, they use inference engines or libraries that support constrained generation natively and can cache compiled grammars or shared prefixes inside the serving stack.

Outlines

Outlines ^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}^{[2]Reference 2Outlines Documentationhttps://dottxt-ai.github.io/outlines/latest/} provides a high-level Python interface for structured generation. In current releases, you build an Outlines model once, then create a reusable Generator for a hot target type. That keeps the model wrapper, tokenizer, and schema-bound generation machinery off the per-request setup path.

This is an integration example, not a no-dependency local script. It requires outlines, transformers, pydantic, a compatible model backend, and enough local compute to load the chosen model.

outlines.py

from typing import Literal

import outlines
from outlines import Generator
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

class Character(BaseModel):
    name: str
    role: Literal["Warrior", "Mage", "Rogue"]
    level: int

model_name = "microsoft/Phi-3-mini-4k-instruct"
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(model_name, device_map="auto"),
    AutoTokenizer.from_pretrained(model_name),
)

character_generator = Generator(model, Character)
raw = character_generator(
    "Create a level-5 fantasy RPG character.",
    max_new_tokens=120,
)
character = Character.model_validate_json(raw)

Enums and required fields can be enforced during decoding, but business rules still belong in post-validation. For example, if level must be between 1 and 100, validate that in your application even if the grammar already narrows the shape.

llama.cpp

llama.cpp exposes grammar constraints directly through GBNF and can also convert a subset of JSON Schema into grammars for its server and CLI flows.^{[6]Reference 6llama.cpp: Inference of LLaMA model in pure C/C++https://github.com/ggml-org/llama.cpp}^{[7]Reference 7llama.cpp Grammars Documentationhttps://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md} One subtle but important detail from its docs is that the schema is used to constrain decoding, not automatically to teach the model what the fields mean. For plain structured generation, you still want prompt instructions that explain the task and the semantics of the fields you're asking for.^{[7]Reference 7llama.cpp Grammars Documentationhttps://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md}

SGLang

SGLang (Structured Generation Language) ^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}^{[4]Reference 4SGLang Structured Outputs Documentationhttps://docs.sglang.io/docs/advanced_features/structured_outputs} goes further by optimizing the runtime for structured workloads. Its paper describes two separate ideas that are easy to conflate:

RadixAttention reuses KV cache for shared token prefixes across requests.
Compressed finite state machines speed up structured decoding itself.^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}

Under the hood, SGLang stores token-prefix mappings in a radix tree and reuses previously computed KV cache when a new request shares a prompt prefix.^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104} That helps Time To First Token (TTFT) when many requests reuse the same system prompt, few-shot examples, or schema instructions. It's a prefix-caching optimization, not a grammar-state cache.

SGLang also provides a domain-specific language (DSL, a specialized syntax for a particular application domain) for interleaving Python control flow with LLM generation. Its structured-output documentation exposes JSON Schema, regex, and grammar constraints; its runtime design separately addresses cache reuse.^{[4]Reference 4SGLang Structured Outputs Documentationhttps://docs.sglang.io/docs/advanced_features/structured_outputs}^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}

XGrammar

XGrammar ^{[8]Reference 8XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Modelshttps://arxiv.org/abs/2411.15100} is a grammar engine built specifically to make context-free-grammar decoding cheap enough for production serving. It attacks the tokenizer-alignment cost from the previous section head on. It splits the vocabulary into context-independent tokens, whose validity can be precomputed regardless of the parser stack, and context-dependent tokens, which must be checked at runtime against the current stack. A persistent stack and overlap with GPU execution then shrink the per-token mask cost further.^{[8]Reference 8XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Modelshttps://arxiv.org/abs/2411.15100}

The paper reports up to 100x faster grammar processing than evaluated prior approaches and near-zero structured-generation overhead in its end-to-end serving experiments.^{[8]Reference 8XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Modelshttps://arxiv.org/abs/2411.15100} Serving stacks can expose engines such as XGrammar behind a higher-level structured-output interface; verify which backend and benchmark settings your deployment uses.

Function calling vs. structured outputs

Both techniques constrain model outputs, but they solve different orchestration problems in an AI system. It's common to blur them together because both can involve JSON schemas. Some APIs also allow strict tool-argument schemas, but the core distinction still holds: function calling is about selecting actions, while structured outputs are about returning data in a fixed contract.

Aspect	Function Calling	Structured Outputs
Primary Goal	Action Selection (Tool Use)	Data Extraction / Formatting
Trigger	Application may allow or require a tool call; model supplies arguments	Application requests a format for the returned record
Output	Function name + arguments	Arbitrary JSON object
Control Flow	Loop: Model -> Code -> Model	Linear: Model -> Parser -> App

When to use function calling

Use function calling when the model is an agent that needs to interact with the world. In a typical tool loop, the application exposes allowed tools and may let the model choose one or require a specific call. The application executes an accepted call, then returns its result to the model in a subsequent turn. This allows the agent to formulate a final user-facing response based on fetched data.

This example defines a tool for an agent to fetch CI run status. If tool choice is left automatic, the model can either call the function with structured arguments or return a normal text response:

when-to-use-function-calling.py

tools = [
    {
        "type": "function",
        "name": "get_run_status",
        "description": "Fetch current CI status for a run.",
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "run_id": {"type": "string"}
            },
            "required": ["run_id"],
            "additionalProperties": False
        }
    }
]

tool_schema = tools[0]
example_routes = {
    "What happened to RUN-842?": "get_run_status",
    "Hi!": "text_response",
}
print("tool:", tool_schema["name"])
print("strict:", tool_schema["strict"])
print("routes:", example_routes)

Output

tool: get_run_status
strict: True
routes: {'What happened to RUN-842?': 'get_run_status', 'Hi!': 'text_response'}

If your provider supports strict tool schemas, enable them. That improves argument reliability, but the control flow is still tool invocation rather than terminal data extraction. On OpenAI, manual strict tool schemas should set strict: true, list every parameter in required, and close objects with additionalProperties: false. parallel_tool_calls=false is useful when your application expects zero or one call, but that's an orchestration choice, not the definition of strictness.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

When to use structured outputs

Use structured outputs when you need to extract data or create a reliable interface between the LLM and your code. Unlike the multi-turn loop of function calling, structured outputs are typically terminal, single-turn actions used purely for data formatting and strict schema adherence.

This example defines a Pydantic model for a run-status update and passes it directly to OpenAI's parsing helper. The SDK handles the JSON schema conversion for you, and output_parsed gives you a typed object back when the model succeeds.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs} This snippet requires the OpenAI Python SDK and an OPENAI_API_KEY.

when-to-use-structured-outputs.py

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class RunStatusUpdate(BaseModel):
    run_id: str
    status: str
    failing_job: str

response = client.responses.parse(
    model="gpt-4o-mini",
    input="Run RUN-842 failed in unit-tests.",
    text_format=RunStatusUpdate,
)

update = response.output_parsed

One provider-specific gotcha: hosted structured-output APIs usually implement a subset of JSON Schema, not the full spec. On OpenAI, that means root-level anyOf isn't allowed, every field must be required, and every object must opt into closed-world generation with additionalProperties: false.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs} Treat schema design as part of your API integration, not as a prompt-writing detail alone.

Typed record moves through schema gate and semantic gate. Clean output continues to acceptance, while bad shape or bad real-world meaning stops on separate reject branches. — Schema-valid JSON is only first gate. Shape passes first, then application code checks ownership, policy, and source-of-truth state before anything downstream trusts it.

The next example keeps these gates separate. Pydantic checks shape and enum values; application code checks whether the authenticated engineer can access the run's repository.

separate-format-from-semantic-validation.py

from typing import Literal

from pydantic import BaseModel, ConfigDict

class RunStatusUpdate(BaseModel):
    model_config = ConfigDict(extra="forbid")

    run_id: str
    status: Literal["queued", "passed", "failed"]

project_runs = {"project-alpha": {"RUN-842"}}

def can_show_update(project_id: str, update: RunStatusUpdate) -> bool:
    return update.run_id in project_runs.get(project_id, set())

parsed = RunStatusUpdate.model_validate({"run_id": "RUN-842", "status": "failed"})
print("format gate:", parsed.model_dump())
print("project allowed:", can_show_update("project-alpha", parsed))
print("other project allowed:", can_show_update("project-beta", parsed))

Output

format gate: {'run_id': 'RUN-842', 'status': 'failed'}
project allowed: True
other project allowed: False

Production patterns

Moving from a prototype that occasionally outputs valid JSON to a production system that processes thousands of requests reliably requires defensive engineering. These patterns address the most common failure modes and lifecycle challenges associated with structured output generation.

1. Preserve useful intermediate evidence for hard tasks

A subtle but important pattern: strict structured output can hurt task quality on some hard problems when the schema is too tight. A controlled study found measurable reasoning-accuracy reductions under format restrictions on evaluated tasks such as GSM8K, with outcomes affected by format and field ordering.^{[9]Reference 9Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Modelshttps://arxiv.org/abs/2408.02442}

Don't respond by storing unrestricted chain-of-thought. Instead, design product-visible intermediate fields that can be checked: cited evidence, extracted quantities, calculation steps needed by a tutor, or a reason code used by a reviewer. If a workflow genuinely needs those fields, place them before the final decision and validate them independently.

1-validate-visible-evidence-before-decision.py

from typing import Literal

from pydantic import BaseModel, model_validator

class IncidentDecision(BaseModel):
    evidence_quote: str
    run_id: str
    decision: Literal["retry_tests", "rollback", "escalate", "unknown"]

    @model_validator(mode="after")
    def cited_run_appears_in_evidence(self) -> "IncidentDecision":
        if self.run_id not in self.evidence_quote:
            raise ValueError("run ID is not supported by evidence")
        return self

decision = IncidentDecision.model_validate(
    {
        "evidence_quote": "Run RUN-842 failed in unit-tests after auth-cache timed out.",
        "run_id": "RUN-842",
        "decision": "retry_tests",
    }
)
print("decision:", decision.decision)
print("evidence checked:", decision.run_id in decision.evidence_quote)

Output

decision: retry_tests
evidence checked: True

This isn't about recovering a hidden "true" chain of thought. The contract exposes evidence that a product or reviewer can inspect when the final decision is wrong.

2. Explicit intermediate fields for complex extraction

A common pitfall is forcing a difficult decision into a final field with no checkable support. For extraction, tutoring, or multi-step workflows, explicit evidence or bounded calculation fields can make the result easier to validate before code acts on it.

The schema below follows the same pattern as OpenAI's math-tutor examples: return a bounded list of explicit steps plus the final answer.

2-explicit-intermediate-fields-for-complex.py

from pydantic import BaseModel

class Step(BaseModel):
    explanation: str
    output: str

class MathSolution(BaseModel):
    steps: list[Step]
    final_answer: str

# Use this when the application needs the intermediate fields.
# Keep them intentional and bounded rather than dumping unstructured text.

3. Handling schema evolution

Schemas change. If producers and consumers disagree about a payload version, a rollout can break downstream processing even when each individual response is valid JSON.

Production tip: Version your schemas in application code. Choose the validator before sending the request, then attach the corresponding version to the validated record or require a fixed version literal and verify it. Don't let the model choose which contract it claims to satisfy.

3-route-versioned-records-with-code-owned-contracts.py

from typing import Literal

from pydantic import BaseModel

class RunStatusV1(BaseModel):
    run_id: str
    status: str

class RunStatusV2(BaseModel):
    run_id: str
    status: str
    failing_job: str

class VersionedRunStatus(BaseModel):
    schema_version: Literal["2"]
    payload: RunStatusV2

generated_payload = {"run_id": "RUN-842", "status": "failed", "failing_job": "unit-tests"}
validated = RunStatusV2.model_validate(generated_payload)
wire_record = VersionedRunStatus(schema_version="2", payload=validated)
print("version:", wire_record.schema_version)
print("failing job:", wire_record.payload.failing_job)

Output

version: 2
failing job: unit-tests

4. Recover by failure class, without weakening the contract

Even with structured outputs, failures can happen. The important distinction is between recoverable failures (for example, truncation from max_output_tokens) and policy outcomes such as refusals or content filtering. Don't treat a refusal as an ordinary parse failure and then retry with a looser mode.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Truncation doesn't justify switching from a schema-enforced response to JSON mode: the missing content is still missing and the fallback loses schema enforcement. Prefer a bounded retry with more output budget, a smaller contract, or chunked input. The fake client below mirrors response states so you can unit-test that control flow without an API call:

4-bounded-retry-preserves-schema-contract.py

from dataclasses import dataclass

from pydantic import BaseModel

class RunStatusUpdate(BaseModel):
    run_id: str
    status: str
    failing_job: str

@dataclass
class ContentItem:
    type: str
    refusal: str | None = None

@dataclass
class MessageOutput:
    content: list[ContentItem]

@dataclass
class IncompleteDetails:
    reason: str

@dataclass
class FakeResponse:
    status: str
    output: list[MessageOutput]
    output_parsed: RunStatusUpdate | None = None
    incomplete_details: IncompleteDetails | None = None

class FakeResponsesApi:
    def __init__(self, responses: list[FakeResponse]) -> None:
        self.responses = iter(responses)

    def parse(self, **kwargs) -> FakeResponse:
        return next(self.responses)

class FakeClient:
    def __init__(self, responses: list[FakeResponse]) -> None:
        self.responses = FakeResponsesApi(responses)

def generate_with_bounded_retry(client: FakeClient, input_text: str) -> RunStatusUpdate:
    response = client.responses.parse(
        model="gpt-4o-mini",
        input=input_text,
        text_format=RunStatusUpdate,
        max_output_tokens=120,
    )

    first_content = response.output[0].content[0]

    if first_content.type == "refusal":
        raise RuntimeError(f"policy refusal: {first_content.refusal}")

    if response.status == "completed" and response.output_parsed is not None:
        return response.output_parsed

    if response.status != "incomplete":
        raise RuntimeError(f"Unexpected response status: {response.status}")

    if response.incomplete_details is None:
        raise RuntimeError("Incomplete response did not include a reason")

    if response.incomplete_details.reason != "max_output_tokens":
        raise RuntimeError(
            f"Structured output halted: {response.incomplete_details.reason}"
        )

    retry = client.responses.parse(
        model="gpt-4o-mini",
        input=input_text,
        text_format=RunStatusUpdate,
        max_output_tokens=400,
    )
    if retry.status != "completed" or retry.output_parsed is None:
        raise RuntimeError("bounded retry did not produce a structured result")
    return retry.output_parsed

output_item = MessageOutput(content=[ContentItem(type="output_text")])
incomplete = FakeResponse(
    status="incomplete",
    output=[output_item],
    incomplete_details=IncompleteDetails(reason="max_output_tokens"),
)
completed = FakeResponse(
    status="completed",
    output=[output_item],
    output_parsed=RunStatusUpdate(run_id="RUN-842", status="failed", failing_job="unit-tests"),
)
update = generate_with_bounded_retry(
    FakeClient([incomplete, completed]),
    "Run RUN-842 failed in unit-tests.",
)
print("retry preserved contract:", update.model_dump())

refused = FakeResponse(
    status="completed",
    output=[MessageOutput(content=[ContentItem(type="refusal", refusal="blocked")])],
)
try:
    generate_with_bounded_retry(FakeClient([refused]), "disallowed request")
except RuntimeError as exc:
    print("refusal routed:", str(exc))

Output

retry preserved contract: {'run_id': 'RUN-842', 'status': 'failed', 'failing_job': 'unit-tests'}
refusal routed: policy refusal: blocked

5. Flatten nested structures

Schemas with many nesting levels or recursive shapes can increase grammar state, output length, and debugging complexity. Some hosted APIs support recursive schemas, but support doesn't establish acceptable latency for your workload.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs} Benchmark nested contracts on the target runtime.

Production tip: Keep nesting as shallow as your interface allows, bound recursive outputs, and benchmark the exact schema on your target runtime. If you need tree-shaped data, a flat list of nodes with parent_id references is often easier to generate, validate, and evolve than a recursive JSON object.

5-reconstruct-a-flat-tree-after-validation.py

from pydantic import BaseModel

class FlatNode(BaseModel):
    node_id: str
    parent_id: str | None
    label: str

nodes = [
    FlatNode(node_id="root", parent_id=None, label="ci_run"),
    FlatNode(node_id="n1", parent_id="root", label="failing_job"),
    FlatNode(node_id="n2", parent_id="root", label="log_url"),
]
children: dict[str | None, list[str]] = {}
for node in nodes:
    children.setdefault(node.parent_id, []).append(node.label)

print("root nodes:", children[None])
print("run fields:", children["root"])

Output

root nodes: ['ci_run']
run fields: ['failing_job', 'log_url']

Performance considerations

Grammar-guided decoding performs valid-token work during generation, although optimized engines may hide or greatly reduce observable overhead for particular workloads. Schema compliance doesn't make latency irrelevant. Measure on the model, schema, batch shape, and runtime you plan to ship.

Where the latency comes from

The latency impact depends on the schema, tokenizer, and runtime design. The main cost centers are:

Cost source	Why it appears	Common mitigation
Grammar compilation	The runtime has to convert a schema or grammar into an indexable guide	Compile once and reuse it across requests
Per-token masking	Each generation step must compute the valid next-token set	Precompute token-prefix tables, compressed FSMs, or categorize context-independent tokens (XGrammar)
First use of a hosted schema	Some providers or model paths may preprocess and cache a new schema before generation starts	Reuse stable schemas and warm hot paths when your provider documents or measurements justify it
Large prompt prefixes	Long schema instructions still consume prefill work and context	Use server-side structured outputs or prefix caching

Interviewers often ask about TTFT versus TPOT for this reason. Compilation, hosted schema preprocessing, and large prompt prefixes mostly affect TTFT. Token masking affects TPOT. Systems such as Outlines, SGLang, and XGrammar focus on reducing those costs with precomputation, token categorization, and cache reuse rather than ignoring the cost.^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}^{[8]Reference 8XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Modelshttps://arxiv.org/abs/2411.15100}^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

The quality-compliance tradeoff

Constraining the output space can affect generation quality because a grammar eliminates paths outside the contract. The effect depends on task, model, and schema; format-restriction studies show it can be measurable on reasoning tasks.^{[9]Reference 9Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Modelshttps://arxiv.org/abs/2408.02442}

Production tip: Keep your schemas semantically permissive, but structurally stable. Prefer a stable field set with nullable values or bounded enums over a maze of branching object variants. If your provider uses strict mode, supported-schema limits and closed-object requirements become part of the interface contract.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Batched constrained generation

When processing multiple requests with the same schema (common in data extraction pipelines), the FSM can be compiled once and reused across requests. This avoids repeated per-request compilation work when the runtime supports that reuse.

Current Outlines calls support the same batching pattern. Build the model wrapper and schema-bound generator once at startup, then reuse that generator across prompts:

batched-constrained-generation.py

# Step 1: Build the Outlines model once at startup
# model = outlines.from_transformers(...)
# generator = outlines.Generator(model, ExtractedEntity)

# Step 2: Reuse the schema-bound generator for each prompt
results = [
    ExtractedEntity.model_validate_json(
        generator(prompt, max_new_tokens=120)
    )
    for prompt in batch_prompts
]

SGLang takes this further with prefix-aware KV cache reuse. If multiple requests share the same prompt prefix, the runtime can reuse prefetched state instead of rebuilding it from scratch.^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}

Debugging common failures

Even with structured outputs, things go wrong. The difference between a prototype and a production system is knowing what failure looks like, why it happens, and how to fix it. This table turns common misconceptions into a debugging guide.

"The model returned valid JSON, so the data must be correct"

Symptom: Your pipeline parses the output successfully, but the values are nonsense. A CI status reads "passed" for a run that failed. A run ID doesn't exist in your build system.
Cause: Structured outputs enforce format, not accuracy. The model can produce valid JSON with correct types while the values are still fabricated. A {"status": "passed"} value satisfies the schema but is wrong for a failed run.
Fix: Add application-layer semantic validation. Check that run IDs exist in CI, that status values match source-of-truth logs, and that enum values match your known set. Use Pydantic validators or plain Python assertions after parsing.

"The Markdown wrapper trap"

Symptom: Your JSON parser throws a JSONDecodeError even though the output looks like JSON at a glance.
Cause: The model wrapped the JSON in triple backticks with a json label, or added a preamble like "Here is the result:". When you feed the raw output into json.loads(), the extra characters break parsing.
Fix: Treat unparseable or schema-invalid text as a contract failure. For a legacy prompt-only integration, make a bounded retry through a stronger interface or send the item to review; don't silently slice arbitrary text between braces and trust it as the record.

the-markdown-wrapper-trap.py

import json

from pydantic import BaseModel, ValidationError

class RunStatusUpdate(BaseModel):
    run_id: str

def accept_typed_record(raw: str) -> str:
    try:
        payload = json.loads(raw)
        RunStatusUpdate.model_validate(payload)
    except (json.JSONDecodeError, ValidationError):
        return "reject: contract not satisfied"
    return "accept: typed record"

print(accept_typed_record('{"run_id": "RUN-842"}'))
print(accept_typed_record('Here is the JSON: {"run_id": "RUN-842"}'))
print(accept_typed_record("```json\n{\"run_id\": \"RUN-842\"}\n```"))

Output

accept: typed record
reject: contract not satisfied
reject: contract not satisfied

Use structured outputs or a grammar-guided runtime when the system must produce the contract directly rather than repair free-form text.

"A refusal isn't a parse error"

Symptom: Your fallback cascade retries a refused request with a looser mode, and the model still refuses. You've spent extra tokens and latency for no gain.
Cause: A refusal or content-filter stop is a policy outcome, not a decoding bug. The model (or the safety layer) has decided not to answer. Loosening the schema doesn't change that decision.
Fix: Surface the refusal to your application layer. Route it to a human reviewer, change the input, or return a polite error to the user. Don't treat refusals as retryable parse failures.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

"Grammar-guided decoding is always slow"

Symptom: You've heard constrained decoding adds overhead and assume it's too slow for your use case.
Cause: Naive implementations can be slow, but optimized runtimes reduce the cost a lot. The overhead depends on tokenizer alignment, grammar complexity, and whether you get cache hits.
Fix: Benchmark before deciding. The right question isn't "is there overhead?" It's "where is the overhead, and can I amortize it?" If you process many requests with the same schema, compilation cost may be reused. Compare hosted schema APIs and optimized self-hosted runtimes on your latency and compliance targets; provider-managed enforcement hides implementation work but doesn't guarantee lower latency.^{[1]Reference 1Efficient Guided Generation for Large Language Models.https://arxiv.org/abs/2307.09702}^{[3]Reference 3SGLang: Efficient Execution of Structured Language Model Programs.https://arxiv.org/abs/2312.07104}^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

"JSON mode and structured outputs are the same thing"

Symptom: You enabled JSON mode and assumed the output would match your schema. It returned {"foo": "bar"} when you expected {"name": "string", "age": "integer"}.
Cause: JSON mode enforces valid JSON syntax on successful completion, but not schema compliance. The model might return any valid JSON object.
Fix: Use structured outputs or grammar-guided decoding when you need schema adherence. Use JSON mode only when you need syntactic validity and plan to validate the shape yourself.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

"I should use structured outputs for every LLM call"

Symptom: You're wrapping every prompt in a Pydantic model, even for creative writing or open-ended Q&A.
Cause: Over-application of a useful technique. Structured outputs shine when the output feeds into code (APIs, databases, downstream processing). For user-facing text responses, free-form generation is often better.
Fix: Use structured outputs when the consumer is code. Use free-form generation when the consumer is a human. Forcing unnecessary structure wastes tokens on syntax characters and may constrain the model's expressiveness.

Mastery check

Evaluation rubric

Foundational: Explain why JSON mode enforces syntax while structured outputs enforce supported schema features.
Intermediate: Describe how grammar-guided decoding constrains token sampling to valid next tokens.
Advanced: Build a constrained generation pipeline with JSON Schema, Pydantic models, or provider-native structured output helpers.
Advanced: Analyze tokenizer alignment, schema compilation, prefix caching, TTFT, and TPOT tradeoffs.
Advanced: Design a production pipeline that handles schema validation, refusal states, truncation, repair, and semantic post-validation.

Follow-up questions

Your parser gets valid JSON, but the run ID doesn't exist in CI. Where do you fix the pipeline?

Fix the pipeline after parsing, not by loosening the schema. The format gate already passed. The failure is semantic: the value doesn't match real system state. Keep structured outputs for the typed contract, then add application checks for ownership, existence, policy, and world-state freshness before you write to a queue or database.

An assistant sometimes needs to fetch live run status and sometimes only return a typed classification. Which interface should you choose?

Use function calling when the application needs the model to propose a tool action and arguments, including flows where a tool is required. Use structured outputs when every answer should end as a terminal record that your code parses once. Strict tool schemas improve argument shape, but authorization and execution policy still stay in application code.

Your provider rejects a schema with root `anyOf` and open objects. What should you change?

Treat the provider schema subset as part of the contract. Reshape the schema into a supported form, require every field the provider expects, and close objects with additionalProperties: false when the API requires it. Don't assume "valid JSON Schema" means "accepted by this runtime."

You need to add a new optional field during rollout. How do you avoid breaking old consumers?

Version the schema and emit that version in the payload. Then let the backend route records through the matching validator or migration layer. Without explicit versioning, a missing field is ambiguous: it could be an older contract or a broken generation.

A recursive schema works in staging, but latency spikes in production. What should you flatten first?

Flatten the contract before you weaken enforcement. Replace nested trees with smaller objects or a flat list of nodes plus parent_id references. That reduces grammar state, output length, and post-validation complexity while preserving the structure your application needs.

A smaller model satisfies the schema but fills required enum fields with weak guesses. What should you change before loosening constraints?

Keep the structure strict. First simplify the task, split the pipeline into smaller stages, add useful intermediate fields, or upgrade to a stronger model for the hard step. Then add semantic validation so a schema-valid but weak answer doesn't silently ship downstream.

One strict-schema request is truncated, and another is refused. Which one is retryable?

The truncation case is retryable because it's a transport or length failure. The refusal case isn't retryable through a weaker parser because it's a policy result. Handle truncation with a bounded fallback or a larger token budget. Handle refusal with policy logic, escalation, or a controlled user-facing error.

Common pitfalls

JSON parses, but keys or nulls still break downstream code. Cause: JSON mode enforced syntax only. Fix: run schema validation after parsing and reject shape drift before the payload reaches application code.
The provider rejects the schema before generation starts. Cause: hosted structured-output APIs usually support only a subset of JSON Schema. Fix: reshape root unions, close objects, and design to the runtime's supported subset.
Strict mode keeps failing on surprise keys. Cause: the object stayed open when the provider expected additionalProperties: false. Fix: close the object explicitly and treat unknown keys as contract failures.
A weaker model fills required enums with brittle guesses. Cause: values were over-constrained before the task was made easier. Fix: keep structure strict, but simplify the task, add useful intermediate fields, or upgrade the model for the hard step.
The record is schema-valid but still wrong. Cause: semantic validation was skipped. Fix: check ownership, existence, dates, policy, and external system state before you route, write, or act.
Old consumers break during rollout. Cause: the payload never identified which schema version produced it. Fix: emit or route with an explicit version so validators can distinguish "old contract" from "broken response."
Refusals are retried with weaker generation rules. Cause: policy outcomes were treated like parser bugs. Fix: route refusals to policy handling, not raw-text fallbacks.
Human-facing text is wrapped in JSON for no reason. Cause: structured outputs were applied even though no code consumes the answer. Fix: reserve strict contracts for machine-consumed responses.
TTFT spikes after a new schema launch. Cause: schema compilation or hot-path reuse was ignored. Fix: reuse stable schemas, warm frequent ones, and measure TTFT separately from TPOT.

Practice: build an incident-digest parser

Here's a concrete exercise to test your understanding. Try it before looking at the solution sketch.

Task: You receive a 500-word incident digest about engineering services. Build a tool that extracts every mentioned service and classifies its operational status as healthy, degraded, outage, or unknown.

Requirements

Define a Pydantic model with two fields: service (str) and status (Literal["healthy", "degraded", "outage", "unknown"]).
Use structured outputs or grammar-guided decoding to enforce the schema.
Handle the case where the model returns no mentions (return an empty list, not a null).
Add a post-validation step that checks the service name against a known service catalog (e.g., auth-api, vector-indexer, billing-export). Flag unknown services for review.

Input example

"auth-api recovered after elevated 5xx errors. vector-indexer remains degraded. report-worker has no known impact."

Expected output shape

expected-output-shape.json

[
  {"service": "auth-api", "status": "healthy"},
  {"service": "vector-indexer", "status": "degraded"},
  {"service": "report-worker", "status": "unknown"}
]

Solution sketch

Click to expand solution sketch

solution-sketch.py

from typing import Literal
from pydantic import BaseModel

class Mention(BaseModel):
    service: str
    status: Literal["healthy", "degraded", "outage", "unknown"]

class IncidentDigest(BaseModel):
    mentions: list[Mention]

KNOWN_SERVICES = {"auth-api", "vector-indexer", "billing-export"}

def unknown_services(digest: IncidentDigest) -> list[str]:
    return [
        mention.service
        for mention in digest.mentions
        if mention.service not in KNOWN_SERVICES
    ]

digest = IncidentDigest.model_validate({
    "mentions": [
        {"service": "auth-api", "status": "healthy"},
        {"service": "vector-indexer", "status": "degraded"},
        {"service": "report-worker", "status": "unknown"},
    ]
})

empty = IncidentDigest(mentions=[])
print("mentions:", len(digest.mentions))
print("unknown services:", unknown_services(digest))
print("empty mentions:", empty.mentions)

Output

mentions: 3
unknown services: ['report-worker']
empty mentions: []

Key design decisions:

Require a list, not a nullable or defaulted field: mentions: list[Mention] makes mentions required in generated JSON Schema and represents no matches as {"mentions": []}. That shape is directly compatible with OpenAI strict mode, which requires every field.
Post-validate status: The schema restricts the value to one of the three literals, but it can't prove the service status is factually correct. Add a second-pass check for high-stakes classifications.
Handle truncation: If the digest is long and the output hits max_output_tokens, use a bounded schema-preserving retry, chunk the input, or explicitly version a smaller contract.

Where structured output leads

Schema-constrained generation can enforce structure during decoding. JSON mode enforces JSON syntax only, while prompt-only formatting remains best effort. The right enforcement tier, failure diagnosis, and fallback cascade determine whether truncation, refusals, and malformed output become recoverable errors or wasted tokens.

Next Step

Continue to ReAct & Plan-and-Execute

There, you'll compare the two core agent control loops: <span data-glossary="react">ReAct</span> for tightly coupled tool use, and Plan-and-Execute for longer workflows with explicit planning and replanning. The structured outputs you learned here become the data contracts that feed into those agent loops.

PreviousRAG Security & Access Control

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Efficient Guided Generation for Large Language Models.

Willard, B. T. & Louf, R. · 2023 · arXiv preprint

Outlines Documentation

Outlines Developers · 2026

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

SGLang Structured Outputs Documentation

SGLang Project · 2026

Structured outputs

OpenAI · 2024

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

llama.cpp Grammars Documentation

ggml-org · 2026

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Dong, Y., Ruan, C. F., Cai, Y., et al. · 2024

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Tam, Z. R., Wu, C.-K., Tsai, Y.-L., et al. · 2024

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Structured Output Generation

When does structured output matter most?

Why "please return JSON" isn't enough

Why does "return JSON" fail as an interface contract?

Contract choices and enforcement mechanisms

What is the difference between JSON mode and structured outputs?

The parser-gate analogy

What does grammar-guided decoding enforce?

How grammar-guided decoding works

From schema to logit masking

Why can a schema-valid result still be wrong?

The tokenizer alignment problem

Why does tokenizer alignment make constrained decoding harder than parsing characters?

Regex vs. context-free grammars

When is a regular expression enough, and when do you need a CFG?

Open-source engines: Outlines, llama.cpp, SGLang, and XGrammar

Outlines

What should you build once at startup in an Outlines-style integration?

llama.cpp

Why does a constrained decoder still need task instructions?

SGLang

Why should you separate prefix caching from grammar-state optimization?

XGrammar

How does XGrammar make grammar-guided decoding cheaper?

Function calling vs. structured outputs

When should you choose function calling instead of structured outputs?

When to use function calling

Why does a strict tool schema not replace authorization checks?

When to use structured outputs

What does additionalProperties: false protect you from?

Production patterns

1. Preserve useful intermediate evidence for hard tasks

Why are visible evidence fields not the same as asking for hidden chain of thought?

2. Explicit intermediate fields for complex extraction

When should you add intermediate fields to a schema?

3. Handling schema evolution

Why should a schema version travel with the validated record?

4. Recover by failure class, without weakening the contract

Which structured-output failures are retryable?

5. Flatten nested structures

Why is a flat list with parent_id often better than recursive JSON?

Performance considerations

Where the latency comes from

Which constrained-generation costs affect TTFT versus TPOT?

The quality-compliance tradeoff

What is the practical schema-design tradeoff?

Batched constrained generation

When does batching or cache reuse pay off for structured generation?

Debugging common failures

"The model returned valid JSON, so the data must be correct"

What is the difference between format validation and semantic validation?

"The Markdown wrapper trap"

Why should a machine-consumed pipeline reject Markdown wrappers?

"A refusal isn't a parse error"

Why should refusals not fall back to looser generation?

"Grammar-guided decoding is always slow"

What benchmark should you run before rejecting constrained decoding?

"JSON mode and structured outputs are the same thing"

What should still happen after JSON mode?

"I should use structured outputs for every LLM call"

What is the simplest decision rule for using structured outputs?

Mastery check

Evaluation rubric

Follow-up questions

Your parser gets valid JSON, but the run ID doesn't exist in CI. Where do you fix the pipeline?

An assistant sometimes needs to fetch live run status and sometimes only return a typed classification. Which interface should you choose?

Your provider rejects a schema with root anyOf and open objects. What should you change?

You need to add a new optional field during rollout. How do you avoid breaking old consumers?

A recursive schema works in staging, but latency spikes in production. What should you flatten first?

A smaller model satisfies the schema but fills required enum fields with weak guesses. What should you change before loosening constraints?

One strict-schema request is truncated, and another is refused. Which one is retryable?

Common pitfalls

Practice: build an incident-digest parser

Requirements

Input example

Expected output shape

Solution sketch

Why does the incident-digest parser flag report-worker for review?

Where structured output leads

Mastery Check

Your provider rejects a schema with root `anyOf` and open objects. What should you change?