LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationPrompt Optimization with DSPy
✍️HardPrompt Engineering

Prompt Optimization with DSPy

Move beyond manual prompt editing. Use DSPy to search prompt and few-shot candidates from data, then release only after held-out evaluation.

29 min read
Learning path
Step 107 of 155 in the full curriculum
Model Merging and Weight InterpolationRecursive Language Models (RLM)

Prompt Optimization with DSPy

Model merging showed one way to adapt a model without full retraining: combine existing checkpoints, evaluate the candidate, and deploy only after it clears release gates. DSPy applies the same discipline to prompts. Instead of hand-editing instructions until they feel right, you define a prompt program, measure it against examples, compile it offline, and release only a tested artifact.

Programmatic prompt optimization treats prompts as tunable programs evaluated against examples. This chapter introduces DSPy modules, metrics, and optimizers so candidate prompt artifacts are selected through measured iteration instead of hand editing alone.

Imagine trying to write a support workflow for delayed shipments. You could list every step in extreme detail, but if the carrier API, policy wording, or model changes, the prompt might fail. This is what manual prompting feels like after the last few chapters: you've learned to craft chain-of-thought prompts and sprinkle in few-shot examples, but those techniques still leave you hand-tuning strings for every new model or data shift.

DSPy (Declarative Self-improving Python)[1] changes the workflow. Instead of hand-tuning every instruction, you provide examples of good order-resolution behavior and let an optimizer search for prompts and demonstrations that score well on a declared metric. That optimization can find a better candidate on development data; a separate holdout test and production monitoring still determine whether it is worth deploying.

DSPy optimization path from stable task signature through metric-guided search to a compiled prompt candidate, contrasted with unmeasured manual edits. DSPy optimization path from stable task signature through metric-guided search to a compiled prompt candidate, contrasted with unmeasured manual edits.
DSPy separates task interface, program flow, metric, and optimizer so prompt variants become scored candidates rather than untracked string edits.

The problem with manual prompting

Why manual prompts fail at scale

  1. Model coupling: A prompt tuned for one model family can degrade badly on another.
  2. Fragility: Small wording changes can materially shift quality.
  3. Weak iteration loop: How do you systematically compare a revised instruction with the current baseline?
  4. Pipeline complexity: When a pipeline chains multiple LM calls, each prompt change can affect downstream behavior.
  5. Untracked deployment state: Which prompt, model, dataset, and metric produced last Tuesday's release?

DSPy's thesis

The compiler analogy is useful with a limit: a DSPy program defines a relatively stable interface and control flow, while compilation searches prompt instructions and demonstrations for one configured LM and metric. Unlike a software compiler, DSPy compilation is empirical optimization. It can overfit its data and does not prove correctness.

Instead of writing prompts, write programs. A program defines:

  • What you want the LLM to do (signatures)
  • How to compose LLM calls (modules)
  • What good looks like (metrics)

Then let an optimizer propose prompt instructions and few-shot examples for your chosen model and metric, and compare the compiled candidate with a baseline on held-out data.

Core concepts

Signatures: declarative Input/Output specs

A signature defines the task without specifying how to prompt. It abstracts away the string formatting, acting as a declarative type signature for an LLM call. Think of it like a function signature in Python: you declare the inputs and outputs, but you don't write the body.

The following runnable code shows both inline signatures for quick experiments and class-based typed signatures for production. We'll use a customer support example throughout this article so you can see how the pieces fit together:

signatures-declarative-input-output-specs.py
1import dspy 2 3# Simple signatures (inline) 4classify = dspy.Predict("ticket_text -> urgency: str") 5qa = dspy.Predict("policy_context, ticket_text -> resolution: str") 6 7# Typed signatures (recommended for production) 8class GenerateResolution(dspy.Signature): 9 """Generate a support resolution using retrieved policy documents.""" 10 policy_context: str = dspy.InputField(desc="Retrieved policy passages") 11 ticket_text: str = dspy.InputField(desc="The customer's message") 12 resolution: str = dspy.OutputField(desc="Concise resolution with policy citation") 13 14print("inline inputs:", ", ".join(classify.signature.input_fields.keys())) 15print("inline outputs:", ", ".join(classify.signature.output_fields.keys())) 16print("typed inputs:", ", ".join(GenerateResolution.input_fields.keys())) 17print("typed outputs:", ", ".join(GenerateResolution.output_fields.keys()))
Output
1inline inputs: ticket_text 2inline outputs: urgency 3typed inputs: policy_context, ticket_text 4typed outputs: resolution

The practical shift is that you stop writing the full prompt text by hand. DSPy derives the initial prompt scaffolding from the signature docstring and field descriptions, then optimizers can rewrite instructions and demos during compilation.

Modules: composable LLM building blocks

Modules compose signatures into larger components. They wrap signatures, hold state such as demonstrations, and define execution flow. This separates program structure from compiled prompt state. If you swap a language model or change a signature, keep the structure where appropriate, then recompile and evaluate rather than assuming the previous artifact transfers.

Here's the support pipeline we'll build. It takes a customer ticket plus policy context and generates a resolution with a Chain-of-Thought signature. In production, a retrieval system can fill policy_context; for the core optimization example, keeping context explicit makes the program easier to test. Notice how the logic flows like standard Python code:

modules-composable-llm-building-blocks.py
1import dspy 2 3class GenerateResolution(dspy.Signature): 4 """Generate a support resolution using retrieved policy documents.""" 5 policy_context: str = dspy.InputField(desc="Retrieved policy passages") 6 ticket_text: str = dspy.InputField(desc="The customer's message") 7 resolution: str = dspy.OutputField(desc="Concise resolution with policy citation") 8 9class SupportPipeline(dspy.Module): 10 def __init__(self): 11 super().__init__() 12 self.generate_resolution = dspy.ChainOfThought(GenerateResolution) 13 14 def forward(self, ticket_text: str, policy_context: str) -> dspy.Prediction: 15 return self.generate_resolution( 16 policy_context=policy_context, 17 ticket_text=ticket_text, 18 ) 19 20pipeline = SupportPipeline() 21predictors = dict(pipeline.named_predictors()) 22signature = predictors["generate_resolution.predict"].signature 23 24print("predictors:", ", ".join(predictors.keys())) 25print("module inputs:", ", ".join(signature.input_fields.keys())) 26print("module outputs:", ", ".join(signature.output_fields.keys()))
Output
1predictors: generate_resolution.predict 2module inputs: policy_context, ticket_text 3module outputs: reasoning, resolution

Notice that GenerateResolution only declares resolution. dspy.ChainOfThought(GenerateResolution) prepends a reasoning field automatically, so you keep the base signature focused on the task interface and let the module add the reasoning trace.

Metrics: defining "Good"

Optimization requires a clear signal to know if changes improve the pipeline. Metrics evaluate the system's output by returning a numerical score (often 0.0 to 1.0). Unlike traditional machine learning where metrics compare floats, LLM metrics often require semantic evaluation to judge text quality.

A metric function can be as simple as an exact string match or as complex as a separate LLM acting as an automated judge. Because the optimizer runs the metric hundreds of times, it must strike a balance between accuracy and computational cost.

The code below shows a metric for our support pipeline. It combines a strict exact-match shortcut for tiny tests with a simple policy-faithfulness check that verifies the answer cites information from the provided context. Both arguments are ordinary DSPy objects: an example from your dataset and a prediction from your program.

metrics-defining-good.py
1import re 2import dspy 3 4def resolution_accuracy(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float: 5 """Metric: does the predicted resolution match the gold resolution?""" 6 if prediction.resolution.strip().lower() == example.resolution.strip().lower(): 7 return 1.0 8 return 0.0 9 10def policy_faithfulness(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float: 11 """Metric: does the resolution cite a policy and reuse context facts?""" 12 resolution = prediction.resolution.lower() 13 context = example.policy_context.lower() 14 cites_policy = "policy" in resolution 15 mentions_credit = bool(re.search(r"\$5|5 dollar", resolution)) 16 context_supports_credit = "$5 credit" in context or "5 dollar credit" in context 17 return 1.0 if cites_policy and mentions_credit and context_supports_credit else 0.0 18 19def support_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float: 20 return 0.5 * resolution_accuracy(example, prediction) + 0.5 * policy_faithfulness(example, prediction) 21 22example = dspy.Example( 23 ticket_text="My package is late.", 24 policy_context="Late Delivery Policy: orders delayed >24h qualify for a $5 credit.", 25 resolution="Per the late-delivery policy, you qualify for a $5 credit.", 26).with_inputs("ticket_text", "policy_context") 27 28good = dspy.Prediction(resolution="Per the late-delivery policy, you qualify for a $5 credit.") 29bad = dspy.Prediction(resolution="Sorry, please keep waiting.") 30 31print(f"good score: {support_metric(example, good):.1f}") 32print(f"bad score: {support_metric(example, bad):.1f}")
Output
1good score: 1.0 2bad score: 0.0

For open-ended support resolutions, you may add an LLM-as-a-judge metric when deterministic checks are too narrow. Judge metrics can vary across runs and can prefer verbose answers or be influenced by output ordering.[2] Lower-temperature decoding may reduce sampling variance but does not remove systematic bias. Before optimizing against a judge, calibrate it on held-out human-labeled examples, document an acceptance threshold appropriate to the task, and keep deterministic policy checks for critical requirements.

Comparison: Traditional Prompting vs. DSPy

FeatureTraditional PromptingDSPy Optimization
EffortManual string tweakingProgrammatic logic
PortabilityHard to move between modelsRecompile for each target model
EvaluationOften ad hoc unless instrumentedExplicit metric and dataset split
Few-shottingHand-picked examplesOptimizer-selected or bootstrapped candidates
VersioningAd-hoc prompt changesCompiled, versioned artifacts
IterationManual comparisonsSearch plus held-out verification

Building a pipeline from scratch

Moving from manual prompting to programmatic optimization is best understood by walking through one concrete pipeline. Let's build the support ticket resolver we sketched above, from signature to compiled artifact.

Define what you want

Start by declaring what the model should do, not how. Our GenerateResolution signature says: "Give me a ticket and some policy context, and I'll produce a resolution." If you want explicit reasoning, choose a module like dspy.ChainOfThought and let it extend the signature for you. Don't write any prompt text yet.

Wire the modules

Combine DSPy modules to create your pipeline. Chain together ChainOfThought, Predict, retrieval modules, or ordinary Python control flow like stacking layers in PyTorch. The structure reflects your logic, not your prompt strings. Our SupportPipeline receives policy passages from the dataset or retrieval layer, then generates a resolution.

Write the metric

Write a function that scores your pipeline's output quality. It should return a number between 0 and 1. Good metrics measure the behavior you care about (accuracy, faithfulness, style alignment), not only string similarity. For our support pipeline, we might combine resolution_accuracy and policy_faithfulness into a single composite score.

Gather examples

Start with representative examples that cover important support slices: late deliveries, refunds, damaged items, and adversarial requests. Sample count is an experimental question; expand or rebalance the set when slice metrics show instability or missing coverage.

Keep separate train, validation, and test splits. Use training examples for bootstrapping, validation examples for selecting a compiled candidate, and test examples only for the promotion decision. MIPROv2 can construct a validation split when you omit one, but an explicit split makes leakage and optimization budget easier to audit.

protect-train-validation-test-splits.py
1train_ids = {"late-001", "refund-001", "damage-001"} 2validation_ids = {"late-101", "refund-101"} 3test_ids = {"late-201", "refund-201"} 4 5def assert_disjoint_splits(*splits: set[str]) -> None: 6 for index, left in enumerate(splits): 7 for right in splits[index + 1:]: 8 overlap = left & right 9 if overlap: 10 raise ValueError(f"leaked examples: {sorted(overlap)}") 11 12assert_disjoint_splits(train_ids, validation_ids, test_ids) 13 14try: 15 assert_disjoint_splits(train_ids, validation_ids, {"refund-001"}) 16except ValueError as exc: 17 print("leakage_rejected:", "refund-001" in str(exc)) 18 19print("clean_split_sizes:", [len(train_ids), len(validation_ids), len(test_ids)])
Output
1leakage_rejected: True 2clean_split_sizes: [3, 2, 2]

Compile

Run an optimizer such as BootstrapFewShot or MIPROv2 on your development machine or training instance. It searches instruction and demonstration candidates, then returns the best candidate it observed under its selection metric. That candidate may tie or lose to the uncompiled baseline on a held-out test set, so measure both before deployment. Compilation consumes LM calls; rerun it when the model, metric, adapter, or task distribution changes.

Deploy

Save either the compiled program state to JSON or the full program as a serialized artifact. In production, load the right form and run inference. There's no optimizer loop at runtime; serving uses the optimized prompts and demos.

How the optimizer works

The optimizer is the engine that searches alternative program state. Older DSPy papers and internals often call this a teleprompter, but current docs center the term optimizer. It takes your program, data, and metric, and outputs a compiled candidate with selected parameters such as instructions and demonstrations.

Think of it like a hyperparameter search for prompts. Suppose you have 20 support tickets in your validation set. The optimizer might try:

  • Version A: zero-shot with a generic instruction
  • Version B: three few-shot examples from successful traces
  • Version C: a rewritten instruction plus five optimized demos

It scores candidates on validation tickets and keeps the version with the highest observed validation score. A final test comparison protects against selecting a prompt that happened to fit validation examples unusually well.

Formally, we want to find the parameter set ϕ\phiϕ that maximizes the expected metric MMM over a dataset DDD:

ϕ∗=arg⁡max⁡ϕEx∈D[M(x,Programϕ(x))]\phi^* = \arg\max_{\phi} \mathbb{E}_{x \in D} [M(x, \text{Program}_\phi(x))]ϕ∗=argϕmax​Ex∈D​[M(x,Programϕ​(x))]

Where ϕ\phiϕ represents program parameters (instructions and demonstrations), DDD is typically a held-out validation set, and MMM is the metric we optimize (for example, accuracy or F1 score, which balances precision and recall).

This is a discrete optimization problem, not gradient descent. The parameters ϕ\phiϕ include natural-language instruction strings and sets of few-shot examples, neither of which expose a gradient through a hosted LM call. DSPy optimizers use methods such as random search, surrogate-guided search, or reflective evolution depending on the optimizer selected. Cost depends on candidate count, evaluation set, metric, and configured LMs.

Reading the formula

Try different instruction and demonstration configurations ϕ\phiϕ, run the resulting program on examples xxx from the evaluation set DDD, score each output with MMM, and pick ϕ∗\phi^*ϕ∗, the configuration with the highest average score. In practice, the expectation E\mathbb{E}E is estimated with the sample mean over the evaluation set.

select-on-validation-promote-on-test.py
1candidates = { 2 "baseline": {"validation": [0.7, 0.7], "test": [0.8, 0.8]}, 3 "compiled_a": {"validation": [1.0, 0.9], "test": [0.6, 0.7]}, 4} 5 6def mean(values: list[float]) -> float: 7 return sum(values) / len(values) 8 9selected = max(candidates, key=lambda name: mean(candidates[name]["validation"])) 10baseline_test = mean(candidates["baseline"]["test"]) 11selected_test = mean(candidates[selected]["test"]) 12 13print("validation_winner:", selected) 14print("selected_test_score:", round(selected_test, 2)) 15print("promote_over_baseline:", selected_test >= baseline_test)
Output
1validation_winner: compiled_a 2selected_test_score: 0.65 3promote_over_baseline: False

The following architecture diagram shows the optimization path from program and examples to a compiled artifact. The search loop happens inside the optimizer; the important boundary is that candidates are scored on held-out examples before one is selected.

DSPy optimizer loop selecting a prompt candidate on validation data before an independent test and release-gate decision. DSPy optimizer loop selecting a prompt candidate on validation data before an independent test and release-gate decision.
The optimizer selects a candidate on validation examples; an independent holdout test and release gates decide whether that artifact can serve traffic.

Mechanism: BootstrapFewShot

This optimizer works by "teaching" the student model using a teacher model (often the same model).

  1. Bootstrapping: The optimizer runs the unoptimized program on the training set.
  2. Filtering: It checks which outputs satisfy the metric.
  3. Trace Generation: For successful outputs, it captures the trace (inputs, intermediate steps, outputs).
  4. Selection: It keeps up to the configured number of metric-accepted traces as demonstrations in the compiled candidate.

If successful traces exist, BootstrapFewShot can place accepted traces into a candidate program's demonstrations. Compare the compiled result with the uncompiled program; bootstrapping is not itself a quality guarantee. This runnable snippet prepares the optimizer without making an LM request. The commented compile call is the boundary that requires a configured LM:

configure-bootstrapfewshot-without-lm-calls.py
1import dspy 2 3def resolution_accuracy(example, prediction, trace=None): 4 return float(example.resolution == prediction.resolution) 5 6trainset = [ 7 dspy.Example( 8 ticket_text="My package was supposed to arrive yesterday but tracking hasn't updated.", 9 resolution="Per the late-delivery policy, you qualify for a $5 credit.", 10 policy_context="Late Delivery Policy: orders delayed >24h qualify for a $5 credit.", 11 ).with_inputs("ticket_text", "policy_context"), 12] 13 14optimizer = dspy.BootstrapFewShot( 15 metric=resolution_accuracy, 16 max_bootstrapped_demos=4, 17 max_labeled_demos=16, 18) 19 20# After dspy.configure(lm=...), compilation makes LM calls: 21# compiled = optimizer.compile(SupportPipeline(), trainset=trainset) 22print("train_examples:", len(trainset)) 23print("max_bootstrapped_demos:", optimizer.max_bootstrapped_demos) 24print("max_labeled_demos:", optimizer.max_labeled_demos)
Output
1train_examples: 1 2max_bootstrapped_demos: 4 3max_labeled_demos: 16

Mechanism: MIPROv2

MIPROv2 is DSPy's documented optimizer for jointly searching instruction text and few-shot examples. It uses grounded instruction proposal plus surrogate-model-guided search over candidate prompt configurations.[3]

  1. Proposal: An LLM generates multiple candidate instructions for the task based on the data and program structure.
  2. Search: The optimizer explores the space of (Instruction, Few-Shot Set) pairs.
  3. Evaluation: It evaluates candidates on a validation set, optionally with minibatches, to select a candidate for independent testing.

The following runnable example exercises the manual-control guard without performing optimization. In DSPy 3.0.3, if you provide num_candidates yourself, set auto=None and pass num_trials to compile(...). MIPROv2 also requires configured prompt and task LMs before a real compile run:

configure-miprov2-manual-budget.py
1import dspy 2 3def metric(example, prediction, trace=None): 4 return 1.0 5 6# Construction does not send a request; compile would use these configured LMs. 7lm = dspy.LM("openai/not-called-in-this-example", api_key="unused") 8 9optimizer = dspy.MIPROv2( 10 metric=metric, 11 prompt_model=lm, 12 task_model=lm, 13 auto=None, 14 num_candidates=4, 15) 16student = dspy.Predict("ticket_text -> resolution") 17trainset = [ 18 dspy.Example( 19 ticket_text="My refund is late.", 20 resolution="Check refund status.", 21 ).with_inputs("ticket_text"), 22] 23 24try: 25 optimizer.compile(student, trainset=trainset) 26except ValueError as exc: 27 print("missing_num_trials_rejected:", "num_trials" in str(exc)) 28print("manual_candidate_count:", optimizer.num_candidates)
Output
1missing_num_trials_rejected: True 2manual_candidate_count: 4

MIPROv2 imports optuna during its compile search path in the currently tested DSPy version. On machines that run this optimizer, install dspy[optuna].

If you leave valset=None, the currently tested implementation constructs one from the provided trainset. That convenience path can change with DSPy releases and cannot replace a declared test set. Pass explicit train, validation, and test splits for comparisons you intend to deploy.

The execution trace: why the graph matters

In traditional prompt engineering, the evaluation function often only sees the final string output of the language model. If a multi-step prompt outputs a correct answer, the system may assume success even if an intermediate query rewrite or reasoning step was poor.

DSPy supports this through execution traces collected during optimization. Trace-aware optimizers can use successful predictor calls for demonstration bootstrapping, while trace-aware metrics can provide intermediate feedback. For user-facing prompt or response inspection, use LM history rather than depending on private trace object shapes.

Trace collection supports program-level debugging and bootstrapping:

  1. Granular Attribution: If a multi-stage LM pipeline fails, the trace helps you pinpoint which predictor produced the bad intermediate output.
  2. Intermediate Supervision: You can write metrics that inspect intermediate predictions, such as a generated policy search query or reasoning field, rather than only grading the final answer.
  3. Trace-based Few-Shot Generation: When BootstrapFewShot finds a successful execution path, it can reuse that successful trace as few-shot supervision for future runs.

That makes LM pipelines more inspectable during optimization, while metrics and held-out tests remain responsible for quality decisions.

Choosing an optimizer

Choosing the right optimizer depends on your data availability, latency constraints, and the acceptable cost of compilation. When beginning a new project, you might not have enough examples to run complex optimizers. Starting with LabeledFewShot allows you to establish a baseline using manually curated examples.

As you collect representative examples, you can compare BootstrapFewShot against your baseline. This optimizer uses your LM and metric to accept useful traces. When joint search over instructions and demos is worth additional compile budget, evaluate MIPROv2.

OptimizerBest ForMechanismTypical SetupCost
LabeledFewShotBaseline with hand-picked demosJust inserts provided examplesSmall curated example setLow
BootstrapFewShotTrace-based demo candidateBootstraps metric-accepted traces into demosRepresentative training examplesMedium
BootstrapFewShotWithRandomSearch / BootstrapRSBetter demo searchScores many candidate demo setsExplicit validation split helpsHigh
MIPROv2Joint instruction and demo searchGrounded proposals plus surrogate-guided searchSeparate validation set plus more metric budgetVery High
GEPAReflection-based instruction searchReflective rewrites from traces, Pareto candidate selectionReflection LM plus feedback-aware metricVery High

Remember that the optimization process itself calls the LLM repeatedly, which incurs API costs and takes time. Balance the compilation cost against the expected improvement in inference accuracy.

Current DSPy docs also list GEPA, a reflective prompt-evolution optimizer, and BootstrapFinetune for adapting model weights rather than only prompt parameters. This article stays focused on BootstrapFewShot, BootstrapRS, and MIPROv2 because they expose demo bootstrapping and instruction search directly.

GEPA uses a reflection LM and textual feedback from execution outcomes to propose instruction revisions, while its Pareto selection can retain candidates that perform differently across objectives.[4] The paper reports strong results against its evaluated reinforcement-learning baselines, but that result does not select an optimizer for a new application. GEPA needs a suitable reflection LM and a feedback-aware metric, so compare its measured gains and compile cost with simpler optimizers.

Saving and loading programs

Current DSPy exposes two save/load patterns, and mixing them up is a common production mistake.

  1. State-only save: program.save("./optimized_support.json", save_program=False) writes program state such as signatures, demonstrations, and module settings to a JSON file. You must recreate compatible Python program structure before calling .load(...).
  2. Whole-program save: program.save("./optimized_support/", save_program=True) serializes architecture and state together into a directory. You later restore it with dspy.load("./optimized_support/").

State-only JSON is safer and easier to review. Whole-program loading is more convenient, but it uses cloudpickle to serialize the program, so treat the saved directory like executable code and only load trusted artifacts.

state-only-save-and-load.py
1from pathlib import Path 2from tempfile import TemporaryDirectory 3import dspy 4 5program = dspy.Predict("ticket_text -> resolution") 6 7with TemporaryDirectory() as temp_dir: 8 artifact = Path(temp_dir) / "support_state.json" 9 program.save(str(artifact), save_program=False) 10 11 restored = dspy.Predict("ticket_text -> resolution") 12 restored.load(str(artifact)) 13 14 print("state_file_created:", artifact.exists()) 15 print("restored_outputs:", ", ".join(restored.signature.output_fields.keys()))
Output
1state_file_created: True 2restored_outputs: resolution

For a trusted whole-program artifact, the corresponding API is compiled_support.save("./optimized_support_v3/", save_program=True) followed by dspy.load("./optimized_support_v3/"). Because that path deserializes a cloudpickled program, only load an artifact you trust and version.

Production deployment

Compilation invokes LMs repeatedly and belongs outside the request path. Its cost depends on optimizer settings, models, metric calls, and dataset size. Separate compile time from serving time in production:

  1. Compile Time: Run the optimizer offline on a dev box or training instance.
  2. Save: Export either state JSON or a whole-program artifact.
  3. Run Time: In production, load the artifact and execute the already-optimized program. No optimizer loop runs online. If you load a whole-program artifact, you must opt into pickle loading explicitly.

The workflow diagram below shows the separation between offline compilation and low-latency production execution:

DSPy lifecycle showing offline compilation, independent holdout gating, versioned release artifacts, and online serving. DSPy lifecycle showing offline compilation, independent holdout gating, versioned release artifacts, and online serving.
Compile candidates offline, release only after holdout gates pass, and trigger a new comparison from measured drift or model change.
Boundary between a released DSPy artifact, read-only runtime loading, and measured recompilation triggers, with versioned serving context. Boundary between a released DSPy artifact, read-only runtime loading, and measured recompilation triggers, with versioned serving context.
A released DSPy artifact is not just prompt text. Version artifact, target model, evaluation metric, dataset split, optimizer settings, and source program together so behavior can be explained and rolled back.

Use the state-only JSON path from the previous section for ordinary deployments. Reserve whole-program serialization for trusted internal artifacts where convenience is worth the cloudpickle loading risk.

block-artifact-promotion-on-critical-slices.py
1baseline = {"late_delivery": 0.84, "refund": 0.91} 2compiled = {"late_delivery": 0.95, "refund": 0.83} 3minimum = {"late_delivery": 0.84, "refund": 0.88} 4 5failed = { 6 task: score 7 for task, score in compiled.items() 8 if score < minimum[task] 9} 10improved_average = sum(compiled.values()) > sum(baseline.values()) 11 12print("improved_average:", improved_average) 13print("failed_release_gates:", sorted(failed)) 14print("promote:", not failed)
Output
1improved_average: True 2failed_release_gates: ['refund'] 3promote: False

Model swapping is a new experiment

A stable DSPy program interface lets you run a new compilation experiment when you swap underlying models without rewriting all orchestration logic. Because instructions and few-shot examples can be model-specific, recompilation is a candidate-generation step, not evidence that a smaller or cheaper LM meets the old release gates.

For example, you might prototype the support pipeline on a stronger hosted model, then recompile the same DSPy program for a smaller open model such as Llama-3.1-8B-Instruct or Qwen2.5-7B-Instruct.

The optimizer can search new few-shot examples and instruction phrasing for the new model. That is prompt-state optimization, not weight distillation or a substitute for evaluating the target LM.

Cross-model calibration

When swapping models, do not assume an old compiled artifact transfers. Instruction phrasing, adapter behavior, demonstrations, and provider revisions can all change measured output.

Maintain a compilation matrix: for each program version and exact target-model revision, store and evaluate its artifact separately. A provider-side model update is a new comparison, even when the model alias is unchanged.

require-artifact-for-exact-model-revision.py
1artifacts = { 2 ("support-v3", "hosted-model@2026-04-01"): "support-v3_hosted-2026-04-01.json", 3 ("support-v3", "local-model@rev7"): "support-v3_local-rev7.json", 4} 5 6def artifact_for(program_version: str, model_revision: str) -> str: 7 key = (program_version, model_revision) 8 if key not in artifacts: 9 raise KeyError("compile and evaluate an artifact for this model revision") 10 return artifacts[key] 11 12print("known_artifact:", artifact_for("support-v3", "local-model@rev7")) 13try: 14 artifact_for("support-v3", "local-model@rev8") 15except KeyError: 16 print("uncompiled_revision_rejected:", True)
Output
1known_artifact: support-v3_local-rev7.json 2uncompiled_revision_rejected: True

Multi-hop pipeline patterns

For complex support tickets that span multiple policies, DSPy's modularity can represent multi-step retrieval. This sketch requires a configured retriever and LM, so it is architecture pseudocode rather than an offline runnable example:

multi-hop-pipeline-patterns.py
1class MultiHopSupport(dspy.Module): 2 def __init__(self, num_hops=3): 3 super().__init__() 4 self.retrieve = dspy.Retrieve(k=3) 5 self.generate_query = [dspy.ChainOfThought("context, ticket_text -> search_query") for _ in range(num_hops)] 6 self.generate_resolution = dspy.ChainOfThought(GenerateResolution) 7 8 def forward(self, ticket_text: str) -> dspy.Prediction: 9 context = [] 10 for hop_module in self.generate_query: 11 search_query = hop_module(context="\n".join(context), ticket_text=ticket_text).search_query 12 policies = self.retrieve(search_query).passages 13 context.extend(policies) 14 15 return self.generate_resolution(policy_context="\n".join(context), ticket_text=ticket_text)

Each query-generation module can expose its own prompt state to optimization. Whether optimizing multiple hops improves retrieval and final answers requires trace inspection and held-out evaluation; a larger search surface can also spend more compile budget.

When beginners get stuck

These failure modes are worth checking early because each can consume compile budget without producing a deployable artifact.

The "exact match" trap

Symptom: Your metric uses exact string comparison on long-form text, so the optimizer gives up because almost no output ever scores 1.0.

Cause: You're treating generation like classification. A resolution that says "Your package will arrive tomorrow" and one that says "Estimated delivery: tomorrow" mean the same thing, but exact match gives them a 0.

Fix: Use an LLM-as-judge metric, or at least a heuristic like "Does the resolution mention a policy number and a timeframe?" Exact match is fine for short labels (urgency levels), but it's the wrong tool for open-ended generation.

Data hunger

Symptom: You delay compiling because you think you need thousands of labeled tickets.

Cause: You're importing habits from traditional deep learning, where more data usually helps. DSPy optimizers search over prompt configurations, not model weights.

Fix: Start with a small, representative split, measure uncertainty and slice coverage, then add examples where evaluation shows gaps. Do not claim a required sample count before measuring the task.

Not inspecting the compiled prompt

Symptom: The compiled pipeline works on your test set, but fails strangely in production and you can't explain why.

Cause: You treated the optimizer as a black box. You don't know which few-shot examples it selected or how it rewrote the instructions.

Fix: After compilation, inspect prompt history with the documented dspy.inspect_history(n=...) helper or the configured LM's history method. If selected examples are edge cases or instructions are misleading, revise training coverage or the metric and rerun the comparison.

Hidden production mistakes

Some DSPy failures don't look like prompt failures at first:

  • Treating DSPy as a prompt management library instead of an optimizer. If you never define a metric and compile against examples, you're using the interface layer but skipping the learning loop.
  • Passing num_candidates or num_trials to MIPROv2 while leaving auto enabled. Manual mode requires auto=None, num_candidates on the optimizer, and num_trials during compile(...).
  • Installing only the base package on machines that run MIPROv2. The optimizer imports optuna during compilation, so compile hosts should install dspy[optuna].
  • Over-optimizing on a tiny validation set. If a candidate wins because it memorized five tickets, it won't survive real support traffic.
  • Assuming a JSON save captures your Python program architecture. State-only JSON needs the same module class recreated before .load(...).
  • Loading whole-program artifacts from untrusted sources. dspy.load(...) restores a cloudpickled program, so treat it like executable code.
  • Ignoring compile cost. Heavy optimizers can call LMs many times, so budget compile runs separately from serving traffic.

Practice checkpoints

Use these questions to check whether you can reason about DSPy beyond syntax.

How does MIPROv2 differ from simple few-shot selection?

MIPROv2 jointly optimizes instruction text and demonstration sets. BootstrapFewShot mainly bootstraps demos from successful traces. MIPROv2 searches a larger space by proposing grounded instructions, pairing them with demo candidates, and using Bayesian optimization over measured candidate programs.[3]

Why is model coupling a serious issue in traditional prompt engineering?

A prompt can become overfit to one model family. When you move to a new provider, checkpoint, tokenizer, adapter, or decoding setup, the old wording may degrade. DSPy doesn't remove that risk, but it makes the migration controlled: keep the program interface and metric, then recompile for the target model.

What does the trace parameter give a metric?

It can expose intermediate predictor calls rather than only the final answer. That lets a metric grade a generated search query, a reasoning field, or a policy-selection step inside a multi-stage program. Exact trace internals can change across DSPy versions, so production metrics should depend on documented behavior and local tests, not private object shapes.

How are state-only JSON artifacts different from whole-program saves?

program.save("./artifact.json", save_program=False) stores state. You recreate the Python module class, then call .load(...). program.save("./artifact_dir/", save_program=True) stores the program with cloudpickle, then dspy.load("./artifact_dir/") restores it. JSON is safer and easier to review. Whole-program loading is convenient but only belongs in trusted environments.

What you should be able to defend

By the end, you should be able to:

  • Explain why a DSPy Signature is an interface, while a prompt template mixes interface and implementation.
  • Describe how optimizers select or generate useful few-shot examples.
  • Write a custom metric that scores the behavior you care about, not only string similarity.
  • Contrast manual prompt editing with programmatic optimization against train and validation examples.
  • Explain the production lifecycle: compile offline, version the artifact, serve without an optimizer loop, monitor drift, and recompile when behavior changes.
  • Choose state-only JSON or whole-program serialization and justify the security tradeoff.

What to remember

  • DSPy treats prompts as parameters to be optimized, not strings to be manually crafted.
  • Signatures define intent; Modules define structure; Optimizers fill in the details.
  • BootstrapFewShot proposes demonstration-augmented programs from metric-accepted traces; compare its candidate against the baseline.
  • Compile offline, serve without an optimizer loop: Optimization happens before deployment. Re-run it when your model, metric, or data distribution changes.
  • Program structure can transfer; compiled state might not: When you change models, recompile and compare against release gates instead of blindly reusing old prompt artifacts.
Next Step
Continue to Recursive Language Models (RLM)

DSPy turns a prompt program into a compiled artifact served without an online optimizer loop. RLM explores a different inference-time pattern: letting a model invoke recursive computation over inputs that do not fit in one direct prompt.

PreviousModel Merging and Weight Interpolation
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.

Khattab, O., et al. · 2023 · arXiv preprint

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Opsahl-Ong, K., et al. · 2024

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, L. A., Tan, S., Soylu, D., et al. · 2025