LearnApplied LLM EngineeringPrompt Optimization with DSPy

✍️MediumPrompt Engineering

Prompt Optimization with DSPy

Move beyond manual prompt editing. Use DSPy to search prompt and few-shot candidates from data, then release only after held-out evaluation.

24 min read

Learning path

Step 74 of 158 in the full curriculum

Experiment Tracking with MLflow and W&B Model Versioning & Deployment

Model merging showed one way to adapt a model without full retraining: combine existing checkpoints, evaluate the candidate, and deploy only after it clears release gates. DSPy applies the same discipline to prompts. Instead of hand-editing instructions until they feel right, you define a prompt program, measure it against examples, compile it offline, and release only a tested artifact.

A developer assistant that answers CI and deploy-policy questions can start with hand-tuned instructions, but test logs, runbook wording, and model revisions keep changing. A prompt that worked last month may regress after any one of those changes.

DSPy^{[1]Reference 1DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.https://arxiv.org/abs/2310.03714} treats prompts as tunable program state. You define modules and a metric, provide examples of good incident-resolution behavior, and let a DSPy optimizer search for instructions and demonstrations that score well on development data. A separate holdout test and production monitoring still determine whether the compiled candidate is worth deploying.

DSPy optimization path from stable task signature through metric-guided search to a compiled prompt candidate, contrasted with unmeasured manual edits. — DSPy separates task interface, program flow, metric, and optimizer so prompt variants become scored candidates rather than untracked string edits.

The problem with manual prompting

Model coupling: A prompt tuned for one model family can degrade badly on another.
Fragility: Small wording changes can materially shift quality.
Weak iteration loop: How do you systematically compare a revised instruction with the current baseline?
Pipeline complexity: When a pipeline chains multiple LM calls, each prompt change can affect downstream behavior.
Untracked deployment state: Which prompt, model, dataset, and metric produced last Tuesday's release?

DSPy's thesis

The compiler analogy is useful with a limit: a DSPy program defines a relatively stable interface and control flow, while compilation searches prompt instructions and demonstrations for one configured LM and metric. Unlike a software compiler, DSPy compilation is empirical optimization. It can overfit its data and doesn't prove correctness.

Instead of writing prompts, write programs. A program defines:

What you want the LLM to do (signatures)
How to compose LLM calls (modules)
What good looks like (metrics)

Then let an optimizer propose prompt instructions and few-shot examples for your chosen model and metric, and compare the compiled candidate with a baseline on held-out data.

Core concepts

Signatures: declarative Input/Output specs

A signature defines the task without specifying how to prompt. It abstracts away the string formatting and acts like a Python function signature for an LLM call: you declare the inputs and outputs, but you don't write the body.

This runnable code uses both inline signatures for quick experiments and class-based typed signatures for production. The examples stay with one deploy-policy assistant so the pieces fit together:

signatures-declarative-input-output-specs.py

import dspy

# Simple signatures (inline)
classify = dspy.Predict("ticket_text -> urgency: str")
qa = dspy.Predict("policy_context, ticket_text -> resolution: str")

# Typed signatures (recommended for production)
class GenerateResolution(dspy.Signature):
    """Generate an incident resolution using retrieved policy documents."""
    policy_context: str = dspy.InputField(desc="Retrieved policy passages")
    ticket_text: str = dspy.InputField(desc="The engineer's message")
    resolution: str = dspy.OutputField(desc="Concise resolution with policy citation")

print("inline inputs:", ", ".join(classify.signature.input_fields.keys()))
print("inline outputs:", ", ".join(classify.signature.output_fields.keys()))
print("typed inputs:", ", ".join(GenerateResolution.input_fields.keys()))
print("typed outputs:", ", ".join(GenerateResolution.output_fields.keys()))

Output

inline inputs: ticket_text
inline outputs: urgency
typed inputs: policy_context, ticket_text
typed outputs: resolution

The practical shift is that you stop writing the full prompt text by hand. DSPy derives the initial prompt scaffolding from the signature docstring and field descriptions, then optimizers can rewrite instructions and demos during compilation.

Modules: composable LLM building blocks

Modules compose signatures into larger components. They wrap signatures, hold state such as demonstrations, and define execution flow. This separates program structure from compiled prompt state. If you swap a language model or change a signature, keep the structure where appropriate, then recompile and evaluate rather than assuming the previous artifact transfers.

Build a policy pipeline that takes an engineer's ticket plus policy context and generates a resolution with a Chain-of-Thought signature. In production, a retrieval system can fill policy_context; for the core optimization example, keeping context explicit makes the program easier to test. The logic flows like standard Python code:

modules-composable-llm-building-blocks.py

import dspy

class GenerateResolution(dspy.Signature):
    """Generate an incident resolution using retrieved policy documents."""
    policy_context: str = dspy.InputField(desc="Retrieved policy passages")
    ticket_text: str = dspy.InputField(desc="The engineer's message")
    resolution: str = dspy.OutputField(desc="Concise resolution with policy citation")

class PolicyPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_resolution = dspy.ChainOfThought(GenerateResolution)

    def forward(self, ticket_text: str, policy_context: str) -> dspy.Prediction:
        return self.generate_resolution(
            policy_context=policy_context,
            ticket_text=ticket_text,
        )

pipeline = PolicyPipeline()
predictors = dict(pipeline.named_predictors())
signature = predictors["generate_resolution.predict"].signature

print("predictors:", ", ".join(predictors.keys()))
print("module inputs:", ", ".join(signature.input_fields.keys()))
print("module outputs:", ", ".join(signature.output_fields.keys()))

Output

predictors: generate_resolution.predict
module inputs: policy_context, ticket_text
module outputs: reasoning, resolution

Notice that GenerateResolution only declares resolution. dspy.ChainOfThought(GenerateResolution) prepends a reasoning field automatically, so you keep the base signature focused on the task interface and let the module add a generated rationale. That rationale isn't the same thing as the DSPy execution trace introduced later: the trace records predictor calls, inputs, and outputs while the program runs.

Metrics: defining a score

Optimization requires a clear signal to know if changes improve the pipeline. Metrics evaluate the system's output by returning a numerical score (often 0.0 to 1.0). Unlike traditional machine learning where metrics compare floats, LLM metrics often require semantic evaluation to judge text quality.

A metric function can be as simple as an exact string match or as complex as a separate LLM acting as an automated judge. Because an optimizer can run the metric repeatedly, it must strike a balance between accuracy and computational cost.

This example shows a metric for our policy pipeline. It combines a strict exact-match shortcut for tiny tests with a simple policy-faithfulness check that verifies the answer cites information from the provided context. Both arguments are ordinary DSPy objects: an example from your dataset and a prediction from your program.

metrics-defining-good.py

import re
import dspy

def resolution_accuracy(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    """Metric: does the predicted resolution match the gold resolution?"""
    if prediction.resolution.strip().lower() == example.resolution.strip().lower():
        return 1.0
    return 0.0

def policy_faithfulness(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    """Metric: does the resolution cite a policy and reuse context facts?"""
    resolution = prediction.resolution.lower()
    context = example.policy_context.lower()
    cites_policy = "policy" in resolution
    mentions_approval = bool(re.search(r"owner approval|approval from the owner", resolution))
    context_supports_approval = "owner approval" in context
    return 1.0 if cites_policy and mentions_approval and context_supports_approval else 0.0

def policy_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    return 0.5 * resolution_accuracy(example, prediction) + 0.5 * policy_faithfulness(example, prediction)

example = dspy.Example(
    ticket_text="The payment tests failed after the deploy candidate.",
    policy_context="Deploy Policy: failing payment tests block release until rollback owner approval.",
    resolution="Per the deploy policy, block the release until rollback owner approval.",
).with_inputs("ticket_text", "policy_context")

good = dspy.Prediction(resolution="Per the deploy policy, block the release until rollback owner approval.")
bad = dspy.Prediction(resolution="Retry the deploy and watch the dashboard.")

print(f"good score: {policy_metric(example, good):.1f}")
print(f"bad score: {policy_metric(example, bad):.1f}")

Output

good score: 1.0
bad score: 0.0

For open-ended incident resolutions, you may add an LLM-as-a-judge metric when deterministic checks are too narrow. Judge metrics can vary across runs and can prefer verbose answers or be influenced by output ordering.^{[2]Reference 2Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://arxiv.org/abs/2306.05685} Lower-temperature decoding may reduce sampling variance but doesn't remove systematic bias. Before optimizing against a judge, calibrate it on held-out human-labeled examples, document an acceptance threshold appropriate to the task, and keep deterministic policy checks for critical requirements.

Comparison: Traditional Prompting vs. DSPy

Feature	Traditional Prompting	DSPy Optimization
Effort	Manual string tweaking	Programmatic logic
Portability	Hard to move between models	Recompile for each target model
Evaluation	Often ad hoc unless instrumented	Explicit metric and dataset split
Few-shotting	Hand-picked examples	Optimizer-selected or bootstrapped candidates
Versioning	Ad-hoc prompt changes	Compiled, versioned artifacts
Iteration	Manual comparisons	Search plus held-out verification

Building a pipeline from scratch

Moving from manual prompting to programmatic optimization is best understood through one concrete pipeline. The deploy-policy resolver above can be written from signature to compiled artifact.

Define what you want

Start by declaring what the model should do, not how. Our GenerateResolution signature says: "Give me a ticket and some policy context, and I'll produce a resolution." If you want explicit reasoning, choose a module like dspy.ChainOfThought and let it extend the signature for you. Don't write any prompt text yet.

Wire the modules

Combine DSPy modules to create your pipeline. Chain together ChainOfThought, Predict, retrieval modules, or ordinary Python control flow like stacking layers in PyTorch. The structure reflects your logic, not your prompt strings. Our PolicyPipeline receives policy passages from the dataset or retrieval layer, then generates a resolution.

Write the metric

Write a function that scores your pipeline's output quality. It should return a number between 0 and 1. Good metrics measure the behavior you care about (accuracy, faithfulness, style alignment), not string similarity alone. For our deploy-policy pipeline, we might combine resolution_accuracy and policy_faithfulness into a single composite score.

Gather examples

Start with representative examples that cover important developer-assistant slices: CI failures, rollback approvals, break-glass access, and adversarial requests. Sample count is an experimental question; expand or rebalance the set when slice metrics show instability or missing coverage.

Keep separate train, validation, and test splits. Use training examples for bootstrapping, validation examples for selecting a compiled candidate, and test examples only for the promotion decision. MIPROv2 can construct a validation split when you omit one, but an explicit split makes leakage and optimization budget easier to audit.

protect-train-validation-test-splits.py

train_ids = {"ci-001", "rollback-001", "access-001"}
validation_ids = {"ci-101", "rollback-101"}
test_ids = {"ci-201", "rollback-201"}

def assert_disjoint_splits(*splits: set[str]) -> None:
    for index, left in enumerate(splits):
        for right in splits[index + 1:]:
            overlap = left & right
            if overlap:
                raise ValueError(f"leaked examples: {sorted(overlap)}")

assert_disjoint_splits(train_ids, validation_ids, test_ids)

try:
    assert_disjoint_splits(train_ids, validation_ids, {"rollback-001"})
except ValueError as exc:
    print("leakage_rejected:", "rollback-001" in str(exc))

print("clean_split_sizes:", [len(train_ids), len(validation_ids), len(test_ids)])

Output

leakage_rejected: True
clean_split_sizes: [3, 2, 2]

Compile

Run an optimizer such as BootstrapFewShot or MIPROv2 on your development machine or training instance. It searches instruction and demonstration candidates, then returns the best candidate it observed under the selection metric. That candidate may tie or lose to the uncompiled baseline on a held-out test set, so measure both before deployment. Compilation consumes LM calls; rerun it when the model, metric, adapter, or task distribution changes.

Deploy

Save either the compiled program state to JSON or the full program as a serialized artifact. In production, load the right form and run inference. There's no optimizer loop at runtime; serving uses the optimized prompts and demos.

How the optimizer works

The optimizer is the engine that searches alternative program state. Older DSPy papers and internals often call this a teleprompter, but current docs center the term optimizer. It takes your program, data, and metric, and outputs a compiled candidate with selected parameters such as instructions and demonstrations.

Prompt optimization behaves like hyperparameter search over prompt programs. Suppose you have 20 deploy-policy tickets in your validation set. The optimizer might try:

Version A: zero-shot with a generic instruction
Version B: three few-shot examples from successful traces
Version C: a rewritten instruction plus five optimized demos

It scores candidates on validation tickets and keeps the version with the highest observed validation score. A final test comparison protects against selecting a prompt that happened to fit validation examples unusually well.

Formally, we want to find the parameter set $\phi$ that maximizes the expected metric $M$ over a dataset $D$ :

\phi^* = \arg\max_{\phi} \mathbb{E}_{x \in D} [M(x, \text{Program}_\phi(x))]

Where $\phi$ represents program parameters (instructions and demonstrations), $D$ is typically a held-out validation set, and $M$ is the metric we optimize (for example, accuracy or F1 score, which balances precision and recall).

This is a discrete optimization problem, not gradient descent. The parameters $\phi$ include natural-language instruction strings and sets of few-shot examples, neither of which expose a gradient through a hosted LM call. DSPy optimizers use methods such as random search, surrogate-guided search, or reflective evolution depending on the optimizer selected. Cost depends on candidate count, evaluation set, metric, and configured LMs.

Reading the formula

Try different instruction and demonstration configurations $\phi$ , run the resulting program on examples $x$ from the evaluation set $D$ , score each output with $M$ , and pick $\phi^*$ , the configuration with the highest average score. In practice, the expectation $\mathbb{E}$ is estimated with the sample mean over the evaluation set.

select-on-validation-promote-on-test.py

candidates = {
    "baseline": {"validation": [0.7, 0.7], "test": [0.8, 0.8]},
    "compiled_a": {"validation": [1.0, 0.9], "test": [0.6, 0.7]},
}

def mean(values: list[float]) -> float:
    return sum(values) / len(values)

selected = max(candidates, key=lambda name: mean(candidates[name]["validation"]))
baseline_test = mean(candidates["baseline"]["test"])
selected_test = mean(candidates[selected]["test"])

print("validation_winner:", selected)
print("selected_test_score:", round(selected_test, 2))
print("promote_over_baseline:", selected_test >= baseline_test)

Output

validation_winner: compiled_a
selected_test_score: 0.65
promote_over_baseline: False

The optimizer search loop runs from program and examples to a compiled artifact. Validation picks a candidate inside the optimizer; holdout and release gates still decide promotion.

DSPy optimizer sweep where program, examples, and metric feed a search over prompt candidates, a validation winner is selected, and a separate holdout gate decides whether that candidate can ship. — Validation picks a candidate; promotion still waits on holdout and release gates.

Mechanism: BootstrapFewShot

BootstrapFewShot runs a teacher program (often the same unoptimized DSPy program, using the same configured LM) and turns metric-accepted execution traces into demonstrations.

Bootstrapping: The optimizer runs the teacher program on the training set.
Filtering: It checks which outputs satisfy the metric.
Trace Generation: For successful outputs, it captures the trace (inputs, intermediate steps, outputs).
Selection: It keeps up to the configured number of metric-accepted traces as demonstrations in the compiled candidate.

If successful traces exist, BootstrapFewShot can place accepted traces into a candidate program's demonstrations. Compare the compiled result with the uncompiled program; bootstrapping isn't itself a quality guarantee. This runnable snippet prepares the optimizer without making an LM request. The commented compile call is the boundary that requires a configured LM:

configure-bootstrapfewshot-without-lm-calls.py

import dspy

def resolution_accuracy(example, prediction, trace=None):
    return float(example.resolution == prediction.resolution)

trainset = [
    dspy.Example(
        ticket_text="The payment tests failed after the deploy candidate.",
        resolution="Per the deploy policy, block the release until rollback owner approval.",
        policy_context="Deploy Policy: failing payment tests block release until rollback owner approval.",
    ).with_inputs("ticket_text", "policy_context"),
]

optimizer = dspy.BootstrapFewShot(
    metric=resolution_accuracy,
    max_bootstrapped_demos=4,
    max_labeled_demos=16,
)

# After dspy.configure(lm=...), compilation makes LM calls:
# compiled = optimizer.compile(PolicyPipeline(), trainset=trainset)
print("train_examples:", len(trainset))
print("max_bootstrapped_demos:", optimizer.max_bootstrapped_demos)
print("max_labeled_demos:", optimizer.max_labeled_demos)

Output

train_examples: 1
max_bootstrapped_demos: 4
max_labeled_demos: 16

Mechanism: MIPROv2

MIPROv2 is DSPy's documented optimizer for jointly searching instruction text and few-shot examples. It uses grounded instruction proposal plus surrogate-model-guided search over candidate prompt configurations.^{[3]Reference 3Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programshttps://arxiv.org/abs/2406.11695}

Proposal: An LLM generates multiple candidate instructions for the task based on the data and program structure.
Search: The optimizer explores the space of (Instruction, Few-Shot Set) pairs.
Evaluation: It evaluates candidates on a validation set, optionally with minibatches, to select a candidate for independent testing.

This runnable example exercises the manual-control guard without performing optimization. In MIPROv2 manual mode, if you provide num_candidates yourself, set auto=None and pass num_trials to compile(...). MIPROv2 also requires configured prompt and task LMs before a real compile run:

configure-miprov2-manual-budget.py

import dspy

def metric(example, prediction, trace=None):
    return 1.0

# Construction does not send a request; compile would use these configured LMs.
lm = dspy.LM("openai/not-called-in-this-example", api_key="unused")

optimizer = dspy.MIPROv2(
    metric=metric,
    prompt_model=lm,
    task_model=lm,
    auto=None,
    num_candidates=4,
)
student = dspy.Predict("ticket_text -> resolution")
trainset = [
    dspy.Example(
        ticket_text="My rollback approval is stuck.",
        resolution="Check rollback approval status.",
    ).with_inputs("ticket_text"),
]

try:
    optimizer.compile(student, trainset=trainset)
except ValueError as exc:
    print("missing_num_trials_rejected:", "num_trials" in str(exc))
print("manual_candidate_count:", optimizer.num_candidates)

Output

missing_num_trials_rejected: True
manual_candidate_count: 4

MIPROv2 uses optuna during its compile search path. On machines that run this optimizer, install dspy[optuna].

If you leave valset=None, the current implementation constructs one from the provided trainset. That convenience path can change with DSPy releases and can't replace a declared test set. Pass explicit train, validation, and test splits for comparisons you intend to deploy.

The execution trace: why the graph matters

In traditional prompt engineering, the evaluation function often only sees the final string output of the language model. If a multi-step prompt outputs a correct answer, the system may assume success even if an intermediate query rewrite or reasoning step was poor.

DSPy supports this through execution traces collected during optimization. Trace-aware optimizers can use successful predictor calls for demonstration bootstrapping, while trace-aware metrics can provide intermediate feedback. For user-facing prompt or response inspection, use LM history rather than depending on private trace object shapes.

Trace collection supports program-level debugging and bootstrapping:

Granular Attribution: If a multi-stage LM pipeline fails, the trace identifies which predictor produced the bad intermediate output.
Intermediate Supervision: You can write metrics that inspect intermediate predictions, such as a generated policy search query or reasoning field, rather than only grading the final answer.
Trace-based Few-Shot Generation: When BootstrapFewShot finds a successful execution path, it can reuse that successful trace as few-shot supervision for future runs.

That makes LM pipelines more inspectable during optimization, while metrics and held-out tests remain responsible for quality decisions.

Choosing an optimizer

Choosing the right optimizer depends on your data availability, latency constraints, and the acceptable cost of compilation. When beginning a new project, you might not have enough examples to run complex optimizers. Starting with LabeledFewShot gives you a baseline from provided labeled examples.

As you collect representative examples, you can compare BootstrapFewShot against your baseline. This optimizer uses your LM and metric to accept useful traces. When joint search over instructions and demos is worth additional compile budget, evaluate MIPROv2.

Optimizer	Best For	Mechanism	Typical Setup	Cost
LabeledFewShot	Baseline with labeled demos	Samples or inserts provided examples	Small labeled training set	Low
BootstrapFewShot	Trace-based demo candidate	Bootstraps metric-accepted traces into demos	Representative training examples	Medium
BootstrapFewShotWithRandomSearch / BootstrapRS	Better demo search	Scores many candidate demo sets	Explicit validation split helps	High
MIPROv2	Joint instruction and demo search	Grounded proposals plus surrogate-guided search	Separate validation set plus more metric budget	Very High
GEPA	Reflection-based instruction search	Reflective rewrites from traces, Pareto candidate selection	Reflection LM plus feedback-aware metric	Very High

The optimization process itself calls the LLM repeatedly, which incurs API costs and takes time. Balance the compilation cost against the expected improvement in inference accuracy.

Current DSPy docs also list GEPA, a reflective prompt-evolution optimizer, and BootstrapFinetune for adapting model weights rather than only prompt parameters. The examples stay focused on BootstrapFewShot, BootstrapRS, and MIPROv2 because they expose demo bootstrapping and instruction search directly.

GEPA uses a reflection LM and textual feedback from execution outcomes to propose instruction revisions, while its Pareto selection can retain candidates that perform differently across objectives.^{[4]Reference 4GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learninghttps://arxiv.org/abs/2507.19457} The paper reports strong results against its evaluated reinforcement-learning baselines, but that result doesn't select an optimizer for a new application. GEPA needs a suitable reflection LM and a feedback-aware metric, so compare its measured gains and compile cost with simpler optimizers.

Saving and loading programs

Current DSPy exposes two save/load patterns, and mixing them up is a common production mistake.

State-only save: program.save("./optimized_policy.json", save_program=False) writes program state such as signatures, demonstrations, and module settings to a JSON file. You must recreate compatible Python program structure before calling .load(...).
Whole-program save: program.save("./optimized_policy/", save_program=True) serializes architecture and state together into a directory. You later restore a trusted artifact with dspy.load("./optimized_policy/", allow_pickle=True).

State-only JSON is safer and easier to review. Whole-program loading is more convenient, but it uses cloudpickle to serialize the program, so treat the saved directory like executable code and only load trusted artifacts.

state-only-save-and-load.py

from pathlib import Path
from tempfile import TemporaryDirectory
import dspy

program = dspy.Predict("ticket_text -> resolution")

with TemporaryDirectory() as temp_dir:
    artifact = Path(temp_dir) / "policy_state.json"
    program.save(str(artifact), save_program=False)

    restored = dspy.Predict("ticket_text -> resolution")
    restored.load(str(artifact))

    print("state_file_created:", artifact.exists())
    print("restored_outputs:", ", ".join(restored.signature.output_fields.keys()))

Output

state_file_created: True
restored_outputs: resolution

For a trusted whole-program artifact, the corresponding API is compiled_policy.save("./optimized_policy_v3/", save_program=True) followed by dspy.load("./optimized_policy_v3/", allow_pickle=True). Because that path deserializes a cloudpickled program, only load an artifact you trust and version.

Production deployment

Compilation invokes LMs repeatedly and belongs outside the request path. Its cost depends on optimizer settings, models, metric calls, and dataset size. Separate compile time from serving time in production:

Compile Time: Run the optimizer offline on a dev box or training instance.
Save: Export either state JSON or a whole-program artifact.
Run Time: In production, load the artifact and execute the already-optimized program. No optimizer loop runs online. If you load a whole-program artifact, you must opt into pickle loading explicitly.

Offline compilation and low-latency production execution stay separate:

DSPy lifecycle showing offline compilation, independent holdout gating, versioned release artifacts, and online serving. — Compile candidates offline, release only after holdout gates pass, and trigger a new comparison from measured drift or model change.

Boundary between a released DSPy artifact, read-only runtime loading, and measured recompilation triggers, with versioned serving context. — A released DSPy artifact isn't just prompt text. Version artifact, target model, evaluation metric, dataset split, optimizer settings, and source program together so behavior can be explained and rolled back.

Use the state-only JSON path from the previous section for ordinary deployments. Reserve whole-program serialization for trusted internal artifacts where convenience is worth the cloudpickle loading risk.

block-artifact-promotion-on-critical-slices.py

baseline = {"ci_failure": 0.84, "rollback": 0.91}
compiled = {"ci_failure": 0.95, "rollback": 0.83}
minimum = {"ci_failure": 0.84, "rollback": 0.88}

failed = {
    task: score
    for task, score in compiled.items()
    if score < minimum[task]
}
improved_average = sum(compiled.values()) > sum(baseline.values())

print("improved_average:", improved_average)
print("failed_release_gates:", sorted(failed))
print("promote:", not failed)

Output

improved_average: True
failed_release_gates: ['rollback']
promote: False

Model swapping is a new experiment

A stable DSPy program interface lets you run a new compilation experiment when you swap underlying models without rewriting all orchestration logic. Because instructions and few-shot examples can be model-specific, recompilation is a candidate-generation step, not evidence that a smaller or cheaper LM meets the old release gates.

For example, you might prototype the deploy-policy pipeline on a stronger hosted model, then recompile the same DSPy program for a smaller open-weight model that your serving stack can run.

The optimizer can search new few-shot examples and instruction phrasing for the new model. That's prompt-state optimization, not weight distillation or a substitute for evaluating the target LM.

Cross-model calibration

When swapping models, don't assume an old compiled artifact transfers. Instruction phrasing, adapter behavior, demonstrations, and provider revisions can all change measured output.

Maintain a compilation matrix: for each program version and exact target-model revision, store and evaluate its artifact separately. A provider-side model update is a new comparison, even when the model alias is unchanged.

require-artifact-for-exact-model-revision.py

artifacts = {
    ("policy-v3", "hosted-model@2026-04-01"): "policy-v3_hosted-2026-04-01.json",
    ("policy-v3", "local-model@rev7"): "policy-v3_local-rev7.json",
}

def artifact_for(program_version: str, model_revision: str) -> str:
    key = (program_version, model_revision)
    if key not in artifacts:
        raise KeyError("compile and evaluate an artifact for this model revision")
    return artifacts[key]

print("known_artifact:", artifact_for("policy-v3", "local-model@rev7"))
try:
    artifact_for("policy-v3", "local-model@rev8")
except KeyError:
    print("uncompiled_revision_rejected:", True)

Output

known_artifact: policy-v3_local-rev7.json
uncompiled_revision_rejected: True

Multi-hop pipeline patterns

For complex deploy-policy tickets that span multiple policies, DSPy's modularity can represent multi-step retrieval. This sketch requires a configured search function and LM, so it's architecture pseudocode rather than an offline runnable example:

multi-hop-pipeline-patterns.py

class MultiHopPolicy(dspy.Module):
    def __init__(self, search, num_hops=3):
        super().__init__()
        self.search = search
        self.generate_query = [dspy.ChainOfThought("context, ticket_text -> search_query") for _ in range(num_hops)]
        self.generate_resolution = dspy.ChainOfThought(GenerateResolution)

    def forward(self, ticket_text: str) -> dspy.Prediction:
        context = []
        for hop_module in self.generate_query:
            search_query = hop_module(context="\n".join(context), ticket_text=ticket_text).search_query
            policies = self.search(search_query, k=3)
            context.extend(policies)

        return self.generate_resolution(policy_context="\n".join(context), ticket_text=ticket_text)

Each query-generation module can expose its own prompt state to optimization. Whether optimizing multiple hops improves retrieval and final answers requires trace inspection and held-out evaluation; a larger search surface can also spend more compile budget.

When optimization stalls

Check these failure modes early because each can consume compile budget without producing a deployable artifact.

The "exact match" trap

Symptom: Your metric uses exact string comparison on long-form text, so the optimizer gives up because almost no output ever scores 1.0.
Cause: You're treating generation like classification. A resolution that says "Run rollback after owner approval" and one that says "Rollback approved: run it now" may mean the same thing in context, but exact match gives them a 0.
Fix: Use an LLM-as-judge metric, or at least a heuristic like "Does the resolution mention the relevant policy and required approval?" Exact match is fine for short labels (urgency levels), but it's the wrong tool for open-ended generation.

Data hunger

Symptom: You delay compiling because you think you need thousands of labeled tickets.
Cause: You're importing habits from traditional deep learning, where more data usually helps. DSPy optimizers search over prompt configurations, not model weights.
Fix: Start with a small, representative split, measure uncertainty and slice coverage, then add examples where evaluation shows gaps. Don't claim a required sample count before measuring the task.

Not inspecting the compiled prompt

Symptom: The compiled pipeline works on your test set, but fails strangely in production and you can't explain why.
Cause: You didn't inspect the optimizer output. You don't know which few-shot examples it selected or how it rewrote the instructions.
Fix: After compilation, inspect prompt history with the documented dspy.inspect_history(n=...) helper or the configured LM's history method. If selected examples are edge cases or instructions are misleading, revise training coverage or the metric and rerun the comparison.

Hidden production mistakes

Some DSPy failures don't look like prompt failures at first:

Treating DSPy as a prompt management library instead of an optimizer. If you never define a metric and compile against examples, you're using the interface layer but skipping the learning loop.
Passing num_candidates or num_trials to MIPROv2 while leaving auto enabled. Manual mode requires auto=None, num_candidates on the optimizer, and num_trials during compile(...).
Installing only the base package on machines that run MIPROv2. The optimizer imports optuna during compilation, so compile hosts should install dspy[optuna].
Over-optimizing on a tiny validation set. If a candidate wins because it memorized five tickets, it won't survive real developer traffic.
Assuming a JSON save captures your Python program architecture. State-only JSON needs the same module class recreated before .load(...).
Loading whole-program artifacts from untrusted sources. dspy.load(...) restores a cloudpickled program, so treat it like executable code.
Ignoring compile cost. Heavy optimizers can call LMs many times, so budget compile runs separately from serving traffic.

Quick recall prompts

Use these questions to check whether you can reason about DSPy beyond syntax.

How does MIPROv2 differ from simple few-shot selection? MIPROv2 jointly optimizes instruction text and demonstration sets. BootstrapFewShot mainly bootstraps demos from successful traces. MIPROv2 searches a larger space by proposing grounded instructions, pairing them with demo candidates, and using surrogate-guided search over measured candidate programs.^{[3]Reference 3Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programshttps://arxiv.org/abs/2406.11695}

Why is model coupling a serious issue in traditional prompt engineering? A prompt can become overfit to one model family. When you move to a new provider, checkpoint, tokenizer, adapter, or decoding setup, the old wording may degrade. DSPy doesn't remove that risk, but it makes the migration controlled: keep the program interface and metric, then recompile for the target model.

What does the trace parameter give a metric? It can expose intermediate predictor calls rather than only the final answer. That lets a metric grade a generated search query, a reasoning field, or a policy-selection step inside a multi-stage program. Exact trace internals can change across DSPy versions, so production metrics should depend on documented behavior and local tests, not private object shapes.

How are state-only JSON artifacts different from whole-program saves? program.save("./artifact.json", save_program=False) stores state. You recreate the Python module class, then call .load(...). program.save("./artifact_dir/", save_program=True) stores the program with cloudpickle, then dspy.load("./artifact_dir/", allow_pickle=True) restores it. JSON is safer and easier to review. Whole-program loading is convenient but only belongs in trusted environments.

Before you continue

Confirm you can:

Explain why a DSPy Signature is an interface, while a prompt template mixes interface and implementation.
Describe how optimizers select or generate useful few-shot examples.
Write a custom metric that scores the behavior you care about, not string similarity alone.
Contrast manual prompt editing with programmatic optimization against train and validation examples.
Explain the production lifecycle: compile offline, version the artifact, serve without an optimizer loop, monitor drift, and recompile when behavior changes.
Choose state-only JSON or whole-program serialization and justify the security tradeoff.

Key takeaways

DSPy treats prompts as parameters to be optimized, not strings to be manually crafted.
Signatures define intent; Modules define structure; Optimizers fill in the details.
BootstrapFewShot proposes demonstration-augmented programs from metric-accepted traces; compare its candidate against the baseline.
Compile offline, serve without an optimizer loop: Optimization happens before deployment. Re-run it when your model, metric, or data distribution changes.
Program structure can transfer; compiled state might not: When you change models, recompile and compare against release gates instead of blindly reusing old prompt artifacts.

Next Step

Continue to Model Versioning & Deployment

DSPy turns a prompt program into a compiled, evaluated artifact. Model versioning packages that prompt state with its model, dataset, evaluation receipt, deployment policy, and rollback target so the tested candidate can move through release safely.

PreviousExperiment Tracking with MLflow and W&B

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.

Khattab, O., et al. · 2023 · arXiv preprint

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Opsahl-Ong, K., et al. · 2024

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, L. A., Tan, S., Soylu, D., et al. · 2025

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringPrompt Optimization with DSPy

✍️MediumPrompt Engineering

Prompt Optimization with DSPy

Move beyond manual prompt editing. Use DSPy to search prompt and few-shot candidates from data, then release only after held-out evaluation.

24 min read

Learning path

Step 74 of 158 in the full curriculum

Experiment Tracking with MLflow and W&B Model Versioning & Deployment

The problem with manual prompting

Model coupling: A prompt tuned for one model family can degrade badly on another.
Fragility: Small wording changes can materially shift quality.
Weak iteration loop: How do you systematically compare a revised instruction with the current baseline?
Pipeline complexity: When a pipeline chains multiple LM calls, each prompt change can affect downstream behavior.
Untracked deployment state: Which prompt, model, dataset, and metric produced last Tuesday's release?

DSPy's thesis

Instead of writing prompts, write programs. A program defines:

What you want the LLM to do (signatures)
How to compose LLM calls (modules)
What good looks like (metrics)

Then let an optimizer propose prompt instructions and few-shot examples for your chosen model and metric, and compare the compiled candidate with a baseline on held-out data.

Core concepts

Signatures: declarative Input/Output specs

This runnable code uses both inline signatures for quick experiments and class-based typed signatures for production. The examples stay with one deploy-policy assistant so the pieces fit together:

signatures-declarative-input-output-specs.py

import dspy

# Simple signatures (inline)
classify = dspy.Predict("ticket_text -> urgency: str")
qa = dspy.Predict("policy_context, ticket_text -> resolution: str")

# Typed signatures (recommended for production)
class GenerateResolution(dspy.Signature):
    """Generate an incident resolution using retrieved policy documents."""
    policy_context: str = dspy.InputField(desc="Retrieved policy passages")
    ticket_text: str = dspy.InputField(desc="The engineer's message")
    resolution: str = dspy.OutputField(desc="Concise resolution with policy citation")

print("inline inputs:", ", ".join(classify.signature.input_fields.keys()))
print("inline outputs:", ", ".join(classify.signature.output_fields.keys()))
print("typed inputs:", ", ".join(GenerateResolution.input_fields.keys()))
print("typed outputs:", ", ".join(GenerateResolution.output_fields.keys()))

Output

inline inputs: ticket_text
inline outputs: urgency
typed inputs: policy_context, ticket_text
typed outputs: resolution

Modules: composable LLM building blocks

modules-composable-llm-building-blocks.py

import dspy

class GenerateResolution(dspy.Signature):
    """Generate an incident resolution using retrieved policy documents."""
    policy_context: str = dspy.InputField(desc="Retrieved policy passages")
    ticket_text: str = dspy.InputField(desc="The engineer's message")
    resolution: str = dspy.OutputField(desc="Concise resolution with policy citation")

class PolicyPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_resolution = dspy.ChainOfThought(GenerateResolution)

    def forward(self, ticket_text: str, policy_context: str) -> dspy.Prediction:
        return self.generate_resolution(
            policy_context=policy_context,
            ticket_text=ticket_text,
        )

pipeline = PolicyPipeline()
predictors = dict(pipeline.named_predictors())
signature = predictors["generate_resolution.predict"].signature

print("predictors:", ", ".join(predictors.keys()))
print("module inputs:", ", ".join(signature.input_fields.keys()))
print("module outputs:", ", ".join(signature.output_fields.keys()))

Output

predictors: generate_resolution.predict
module inputs: policy_context, ticket_text
module outputs: reasoning, resolution

Metrics: defining a score

metrics-defining-good.py

import re
import dspy

def resolution_accuracy(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    """Metric: does the predicted resolution match the gold resolution?"""
    if prediction.resolution.strip().lower() == example.resolution.strip().lower():
        return 1.0
    return 0.0

def policy_faithfulness(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    """Metric: does the resolution cite a policy and reuse context facts?"""
    resolution = prediction.resolution.lower()
    context = example.policy_context.lower()
    cites_policy = "policy" in resolution
    mentions_approval = bool(re.search(r"owner approval|approval from the owner", resolution))
    context_supports_approval = "owner approval" in context
    return 1.0 if cites_policy and mentions_approval and context_supports_approval else 0.0

def policy_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
    return 0.5 * resolution_accuracy(example, prediction) + 0.5 * policy_faithfulness(example, prediction)

example = dspy.Example(
    ticket_text="The payment tests failed after the deploy candidate.",
    policy_context="Deploy Policy: failing payment tests block release until rollback owner approval.",
    resolution="Per the deploy policy, block the release until rollback owner approval.",
).with_inputs("ticket_text", "policy_context")

good = dspy.Prediction(resolution="Per the deploy policy, block the release until rollback owner approval.")
bad = dspy.Prediction(resolution="Retry the deploy and watch the dashboard.")

print(f"good score: {policy_metric(example, good):.1f}")
print(f"bad score: {policy_metric(example, bad):.1f}")

Output

good score: 1.0
bad score: 0.0

Comparison: Traditional Prompting vs. DSPy

Feature	Traditional Prompting	DSPy Optimization
Effort	Manual string tweaking	Programmatic logic
Portability	Hard to move between models	Recompile for each target model
Evaluation	Often ad hoc unless instrumented	Explicit metric and dataset split
Few-shotting	Hand-picked examples	Optimizer-selected or bootstrapped candidates
Versioning	Ad-hoc prompt changes	Compiled, versioned artifacts
Iteration	Manual comparisons	Search plus held-out verification

Building a pipeline from scratch

Moving from manual prompting to programmatic optimization is best understood through one concrete pipeline. The deploy-policy resolver above can be written from signature to compiled artifact.

Define what you want

Wire the modules

Write the metric

Gather examples

protect-train-validation-test-splits.py

train_ids = {"ci-001", "rollback-001", "access-001"}
validation_ids = {"ci-101", "rollback-101"}
test_ids = {"ci-201", "rollback-201"}

def assert_disjoint_splits(*splits: set[str]) -> None:
    for index, left in enumerate(splits):
        for right in splits[index + 1:]:
            overlap = left & right
            if overlap:
                raise ValueError(f"leaked examples: {sorted(overlap)}")

assert_disjoint_splits(train_ids, validation_ids, test_ids)

try:
    assert_disjoint_splits(train_ids, validation_ids, {"rollback-001"})
except ValueError as exc:
    print("leakage_rejected:", "rollback-001" in str(exc))

print("clean_split_sizes:", [len(train_ids), len(validation_ids), len(test_ids)])

Output

leakage_rejected: True
clean_split_sizes: [3, 2, 2]

Compile

Deploy

How the optimizer works

Prompt optimization behaves like hyperparameter search over prompt programs. Suppose you have 20 deploy-policy tickets in your validation set. The optimizer might try:

Version A: zero-shot with a generic instruction
Version B: three few-shot examples from successful traces
Version C: a rewritten instruction plus five optimized demos

Formally, we want to find the parameter set $\phi$ that maximizes the expected metric $M$ over a dataset $D$ :

\phi^* = \arg\max_{\phi} \mathbb{E}_{x \in D} [M(x, \text{Program}_\phi(x))]

Reading the formula

select-on-validation-promote-on-test.py

candidates = {
    "baseline": {"validation": [0.7, 0.7], "test": [0.8, 0.8]},
    "compiled_a": {"validation": [1.0, 0.9], "test": [0.6, 0.7]},
}

def mean(values: list[float]) -> float:
    return sum(values) / len(values)

selected = max(candidates, key=lambda name: mean(candidates[name]["validation"]))
baseline_test = mean(candidates["baseline"]["test"])
selected_test = mean(candidates[selected]["test"])

print("validation_winner:", selected)
print("selected_test_score:", round(selected_test, 2))
print("promote_over_baseline:", selected_test >= baseline_test)

Output

validation_winner: compiled_a
selected_test_score: 0.65
promote_over_baseline: False

The optimizer search loop runs from program and examples to a compiled artifact. Validation picks a candidate inside the optimizer; holdout and release gates still decide promotion.

Mechanism: BootstrapFewShot

BootstrapFewShot runs a teacher program (often the same unoptimized DSPy program, using the same configured LM) and turns metric-accepted execution traces into demonstrations.

Bootstrapping: The optimizer runs the teacher program on the training set.
Filtering: It checks which outputs satisfy the metric.
Trace Generation: For successful outputs, it captures the trace (inputs, intermediate steps, outputs).
Selection: It keeps up to the configured number of metric-accepted traces as demonstrations in the compiled candidate.

configure-bootstrapfewshot-without-lm-calls.py

import dspy

def resolution_accuracy(example, prediction, trace=None):
    return float(example.resolution == prediction.resolution)

trainset = [
    dspy.Example(
        ticket_text="The payment tests failed after the deploy candidate.",
        resolution="Per the deploy policy, block the release until rollback owner approval.",
        policy_context="Deploy Policy: failing payment tests block release until rollback owner approval.",
    ).with_inputs("ticket_text", "policy_context"),
]

optimizer = dspy.BootstrapFewShot(
    metric=resolution_accuracy,
    max_bootstrapped_demos=4,
    max_labeled_demos=16,
)

# After dspy.configure(lm=...), compilation makes LM calls:
# compiled = optimizer.compile(PolicyPipeline(), trainset=trainset)
print("train_examples:", len(trainset))
print("max_bootstrapped_demos:", optimizer.max_bootstrapped_demos)
print("max_labeled_demos:", optimizer.max_labeled_demos)

Output

train_examples: 1
max_bootstrapped_demos: 4
max_labeled_demos: 16

Mechanism: MIPROv2

Proposal: An LLM generates multiple candidate instructions for the task based on the data and program structure.
Search: The optimizer explores the space of (Instruction, Few-Shot Set) pairs.
Evaluation: It evaluates candidates on a validation set, optionally with minibatches, to select a candidate for independent testing.

configure-miprov2-manual-budget.py

import dspy

def metric(example, prediction, trace=None):
    return 1.0

# Construction does not send a request; compile would use these configured LMs.
lm = dspy.LM("openai/not-called-in-this-example", api_key="unused")

optimizer = dspy.MIPROv2(
    metric=metric,
    prompt_model=lm,
    task_model=lm,
    auto=None,
    num_candidates=4,
)
student = dspy.Predict("ticket_text -> resolution")
trainset = [
    dspy.Example(
        ticket_text="My rollback approval is stuck.",
        resolution="Check rollback approval status.",
    ).with_inputs("ticket_text"),
]

try:
    optimizer.compile(student, trainset=trainset)
except ValueError as exc:
    print("missing_num_trials_rejected:", "num_trials" in str(exc))
print("manual_candidate_count:", optimizer.num_candidates)

Output

missing_num_trials_rejected: True
manual_candidate_count: 4

MIPROv2 uses optuna during its compile search path. On machines that run this optimizer, install dspy[optuna].

The execution trace: why the graph matters

Trace collection supports program-level debugging and bootstrapping:

Granular Attribution: If a multi-stage LM pipeline fails, the trace identifies which predictor produced the bad intermediate output.
Intermediate Supervision: You can write metrics that inspect intermediate predictions, such as a generated policy search query or reasoning field, rather than only grading the final answer.
Trace-based Few-Shot Generation: When BootstrapFewShot finds a successful execution path, it can reuse that successful trace as few-shot supervision for future runs.

That makes LM pipelines more inspectable during optimization, while metrics and held-out tests remain responsible for quality decisions.

Choosing an optimizer

Optimizer	Best For	Mechanism	Typical Setup	Cost
LabeledFewShot	Baseline with labeled demos	Samples or inserts provided examples	Small labeled training set	Low
BootstrapFewShot	Trace-based demo candidate	Bootstraps metric-accepted traces into demos	Representative training examples	Medium
BootstrapFewShotWithRandomSearch / BootstrapRS	Better demo search	Scores many candidate demo sets	Explicit validation split helps	High
MIPROv2	Joint instruction and demo search	Grounded proposals plus surrogate-guided search	Separate validation set plus more metric budget	Very High
GEPA	Reflection-based instruction search	Reflective rewrites from traces, Pareto candidate selection	Reflection LM plus feedback-aware metric	Very High

The optimization process itself calls the LLM repeatedly, which incurs API costs and takes time. Balance the compilation cost against the expected improvement in inference accuracy.

Saving and loading programs

Current DSPy exposes two save/load patterns, and mixing them up is a common production mistake.

State-only save: program.save("./optimized_policy.json", save_program=False) writes program state such as signatures, demonstrations, and module settings to a JSON file. You must recreate compatible Python program structure before calling .load(...).
Whole-program save: program.save("./optimized_policy/", save_program=True) serializes architecture and state together into a directory. You later restore a trusted artifact with dspy.load("./optimized_policy/", allow_pickle=True).

state-only-save-and-load.py

from pathlib import Path
from tempfile import TemporaryDirectory
import dspy

program = dspy.Predict("ticket_text -> resolution")

with TemporaryDirectory() as temp_dir:
    artifact = Path(temp_dir) / "policy_state.json"
    program.save(str(artifact), save_program=False)

    restored = dspy.Predict("ticket_text -> resolution")
    restored.load(str(artifact))

    print("state_file_created:", artifact.exists())
    print("restored_outputs:", ", ".join(restored.signature.output_fields.keys()))

Output

state_file_created: True
restored_outputs: resolution

Production deployment

Compile Time: Run the optimizer offline on a dev box or training instance.
Save: Export either state JSON or a whole-program artifact.
Run Time: In production, load the artifact and execute the already-optimized program. No optimizer loop runs online. If you load a whole-program artifact, you must opt into pickle loading explicitly.

Offline compilation and low-latency production execution stay separate:

block-artifact-promotion-on-critical-slices.py

baseline = {"ci_failure": 0.84, "rollback": 0.91}
compiled = {"ci_failure": 0.95, "rollback": 0.83}
minimum = {"ci_failure": 0.84, "rollback": 0.88}

failed = {
    task: score
    for task, score in compiled.items()
    if score < minimum[task]
}
improved_average = sum(compiled.values()) > sum(baseline.values())

print("improved_average:", improved_average)
print("failed_release_gates:", sorted(failed))
print("promote:", not failed)

Output

improved_average: True
failed_release_gates: ['rollback']
promote: False

Model swapping is a new experiment

For example, you might prototype the deploy-policy pipeline on a stronger hosted model, then recompile the same DSPy program for a smaller open-weight model that your serving stack can run.

The optimizer can search new few-shot examples and instruction phrasing for the new model. That's prompt-state optimization, not weight distillation or a substitute for evaluating the target LM.

Cross-model calibration

When swapping models, don't assume an old compiled artifact transfers. Instruction phrasing, adapter behavior, demonstrations, and provider revisions can all change measured output.

require-artifact-for-exact-model-revision.py

artifacts = {
    ("policy-v3", "hosted-model@2026-04-01"): "policy-v3_hosted-2026-04-01.json",
    ("policy-v3", "local-model@rev7"): "policy-v3_local-rev7.json",
}

def artifact_for(program_version: str, model_revision: str) -> str:
    key = (program_version, model_revision)
    if key not in artifacts:
        raise KeyError("compile and evaluate an artifact for this model revision")
    return artifacts[key]

print("known_artifact:", artifact_for("policy-v3", "local-model@rev7"))
try:
    artifact_for("policy-v3", "local-model@rev8")
except KeyError:
    print("uncompiled_revision_rejected:", True)

Output

known_artifact: policy-v3_local-rev7.json
uncompiled_revision_rejected: True

Multi-hop pipeline patterns

multi-hop-pipeline-patterns.py

class MultiHopPolicy(dspy.Module):
    def __init__(self, search, num_hops=3):
        super().__init__()
        self.search = search
        self.generate_query = [dspy.ChainOfThought("context, ticket_text -> search_query") for _ in range(num_hops)]
        self.generate_resolution = dspy.ChainOfThought(GenerateResolution)

    def forward(self, ticket_text: str) -> dspy.Prediction:
        context = []
        for hop_module in self.generate_query:
            search_query = hop_module(context="\n".join(context), ticket_text=ticket_text).search_query
            policies = self.search(search_query, k=3)
            context.extend(policies)

        return self.generate_resolution(policy_context="\n".join(context), ticket_text=ticket_text)

When optimization stalls

Check these failure modes early because each can consume compile budget without producing a deployable artifact.

The "exact match" trap

Symptom: Your metric uses exact string comparison on long-form text, so the optimizer gives up because almost no output ever scores 1.0.
Cause: You're treating generation like classification. A resolution that says "Run rollback after owner approval" and one that says "Rollback approved: run it now" may mean the same thing in context, but exact match gives them a 0.
Fix: Use an LLM-as-judge metric, or at least a heuristic like "Does the resolution mention the relevant policy and required approval?" Exact match is fine for short labels (urgency levels), but it's the wrong tool for open-ended generation.

Data hunger

Symptom: You delay compiling because you think you need thousands of labeled tickets.
Cause: You're importing habits from traditional deep learning, where more data usually helps. DSPy optimizers search over prompt configurations, not model weights.
Fix: Start with a small, representative split, measure uncertainty and slice coverage, then add examples where evaluation shows gaps. Don't claim a required sample count before measuring the task.

Not inspecting the compiled prompt

Symptom: The compiled pipeline works on your test set, but fails strangely in production and you can't explain why.
Cause: You didn't inspect the optimizer output. You don't know which few-shot examples it selected or how it rewrote the instructions.
Fix: After compilation, inspect prompt history with the documented dspy.inspect_history(n=...) helper or the configured LM's history method. If selected examples are edge cases or instructions are misleading, revise training coverage or the metric and rerun the comparison.

Hidden production mistakes

Some DSPy failures don't look like prompt failures at first:

Treating DSPy as a prompt management library instead of an optimizer. If you never define a metric and compile against examples, you're using the interface layer but skipping the learning loop.
Passing num_candidates or num_trials to MIPROv2 while leaving auto enabled. Manual mode requires auto=None, num_candidates on the optimizer, and num_trials during compile(...).
Installing only the base package on machines that run MIPROv2. The optimizer imports optuna during compilation, so compile hosts should install dspy[optuna].
Over-optimizing on a tiny validation set. If a candidate wins because it memorized five tickets, it won't survive real developer traffic.
Assuming a JSON save captures your Python program architecture. State-only JSON needs the same module class recreated before .load(...).
Loading whole-program artifacts from untrusted sources. dspy.load(...) restores a cloudpickled program, so treat it like executable code.
Ignoring compile cost. Heavy optimizers can call LMs many times, so budget compile runs separately from serving traffic.

Quick recall prompts

Use these questions to check whether you can reason about DSPy beyond syntax.

Before you continue

Confirm you can:

Explain why a DSPy Signature is an interface, while a prompt template mixes interface and implementation.
Describe how optimizers select or generate useful few-shot examples.
Write a custom metric that scores the behavior you care about, not string similarity alone.
Contrast manual prompt editing with programmatic optimization against train and validation examples.
Explain the production lifecycle: compile offline, version the artifact, serve without an optimizer loop, monitor drift, and recompile when behavior changes.
Choose state-only JSON or whole-program serialization and justify the security tradeoff.

Key takeaways

DSPy treats prompts as parameters to be optimized, not strings to be manually crafted.
Signatures define intent; Modules define structure; Optimizers fill in the details.
BootstrapFewShot proposes demonstration-augmented programs from metric-accepted traces; compare its candidate against the baseline.
Compile offline, serve without an optimizer loop: Optimization happens before deployment. Re-run it when your model, metric, or data distribution changes.
Program structure can transfer; compiled state might not: When you change models, recompile and compare against release gates instead of blindly reusing old prompt artifacts.

Next Step

Continue to Model Versioning & Deployment

PreviousExperiment Tracking with MLflow and W&B

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.

Khattab, O., et al. · 2023 · arXiv preprint

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Opsahl-Ong, K., et al. · 2024

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, L. A., Tan, S., Soylu, D., et al. · 2025

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Prompt Optimization with DSPy

The problem with manual prompting

DSPy's thesis

Core concepts

Signatures: declarative Input/Output specs

Why is a DSPy signature more than a prompt template?

Modules: composable LLM building blocks

Metrics: defining a score

Comparison: Traditional Prompting vs. DSPy

Building a pipeline from scratch

Define what you want

Wire the modules

Write the metric

Gather examples

Compile

Deploy

How the optimizer works

What does ϕ\phiϕ represent in DSPy optimization?

Reading the formula

Mechanism: BootstrapFewShot

Mechanism: MIPROv2

The execution trace: why the graph matters

Choosing an optimizer

Saving and loading programs

Production deployment

What must be versioned after a DSPy compile run?

Model swapping is a new experiment

Cross-model calibration

Multi-hop pipeline patterns

When optimization stalls

The "exact match" trap

Data hunger

Not inspecting the compiled prompt

Hidden production mistakes

Quick recall prompts

Before you continue

Key takeaways

Mastery Check

Discussion

Prompt Optimization with DSPy

The problem with manual prompting

DSPy's thesis

Core concepts

Signatures: declarative Input/Output specs

Why is a DSPy signature more than a prompt template?

Modules: composable LLM building blocks

Metrics: defining a score

Comparison: Traditional Prompting vs. DSPy

Building a pipeline from scratch

Define what you want

Wire the modules

Write the metric

Gather examples

Compile

Deploy

How the optimizer works

What does ϕ\phiϕ represent in DSPy optimization?

Reading the formula

Mechanism: BootstrapFewShot

Mechanism: MIPROv2

The execution trace: why the graph matters

Choosing an optimizer

Saving and loading programs

Production deployment

What must be versioned after a DSPy compile run?

Model swapping is a new experiment

Cross-model calibration

Multi-hop pipeline patterns

When optimization stalls

The "exact match" trap

Data hunger

Not inspecting the compiled prompt

Hidden production mistakes

Quick recall prompts

Before you continue

Key takeaways

Mastery Check

Discussion