How to Build Automatic Prompt Optimization with DSPy

Hand-tuning prompts is a dead end. You tweak wording, add examples, rearrange instructions — and the moment you swap models or change your data distribution, everything breaks. DSPy flips this by treating prompts as compiled artifacts. You define what you want (a signature), write how to compose reasoning steps (a module), and then let an optimizer find the best prompt configuration for your metric on your data.

Here’s the fastest way to get a working DSPy pipeline:

1
pip install dspy

1
2
3
4
5
6
7
8
import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

classify = dspy.Predict("text -> label: bool")
result = classify(text="I absolutely love this product!")
print(result.label)  # True

That’s a zero-shot classifier in four lines. But the real power comes when you compile it with an optimizer against labeled data. The optimizer automatically discovers the best instructions, few-shot examples, and reasoning patterns for your specific task.

Define Signatures and Modules

A Signature declares what goes in and what comes out. You can use inline strings for quick prototyping or class-based definitions when you need more control.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import dspy
from typing import Literal

# Inline signature — fast and simple
qa = dspy.ChainOfThought("question -> answer")

# Class-based signature — typed outputs, field descriptions, docstring as instruction
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField(desc="a product review from a customer")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

classify = dspy.Predict(SentimentClassifier)
result = classify(review="The battery dies after two hours. Waste of money.")
print(result.sentiment)  # negative

The docstring on the class becomes the task instruction sent to the LM. The desc on each field tells the model what that field represents. DSPy handles all the prompt formatting behind the scenes.

Modules wrap signatures with reasoning strategies:

dspy.Predict — straightforward input-to-output prediction
dspy.ChainOfThought — adds a reasoning field so the model thinks step-by-step before answering
dspy.ReAct — agent loop that can call external tools
dspy.ProgramOfThought — generates and executes code to derive the answer

You compose modules into programs by subclassing dspy.Module:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import dspy

class MultiHopQA(dspy.Module):
    def __init__(self):
        self.generate_query = dspy.ChainOfThought("context, question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = ""
        for _ in range(2):
            query = self.generate_query(context=context, question=question).search_query
            # In a real pipeline, you'd call a retriever here
            context += f" [Search: {query}]"
        return self.generate_answer(context=context, question=question)

This is standard Python. No prompt templates, no f-strings, no JSON wrangling.

Compile with Optimizers

Optimizers (historically called “teleprompters”) take your program, a training set, and a metric, then search for the prompt configuration that maximizes your metric. The compiled program has the same interface — it just performs better.

First, you need training data as dspy.Example objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import dspy

trainset = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
    dspy.Example(question="What is the boiling point of water in Celsius?", answer="100").with_inputs("question"),
    dspy.Example(question="What planet is closest to the Sun?", answer="Mercury").with_inputs("question"),
    dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci").with_inputs("question"),
    dspy.Example(question="What is the chemical symbol for gold?", answer="Au").with_inputs("question"),
    dspy.Example(question="What year did World War II end?", answer="1945").with_inputs("question"),
    dspy.Example(question="What is the speed of light in km/s?", answer="299792").with_inputs("question"),
]

devset = [
    dspy.Example(question="What is the largest ocean?", answer="Pacific").with_inputs("question"),
    dspy.Example(question="Who discovered penicillin?", answer="Alexander Fleming").with_inputs("question"),
    dspy.Example(question="What is the square root of 144?", answer="12").with_inputs("question"),
]

Then define a metric and compile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from dspy.teleprompt import BootstrapFewShot

def exact_match(example, pred, trace=None):
    return example.answer.lower().strip() in pred.answer.lower().strip()

program = dspy.ChainOfThought("question -> answer")

optimizer = BootstrapFewShot(
    metric=exact_match,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
    max_rounds=1,
)

compiled_program = optimizer.compile(student=program, trainset=trainset)

BootstrapFewShot works by running the teacher (your original program) on training examples, keeping the ones that pass the metric, and inserting those as few-shot demonstrations into the prompt. The compiled program now carries those demonstrations and uses them automatically.

For larger datasets (50+ examples), step up to BootstrapFewShotWithRandomSearch. For 200+ examples, use MIPROv2, which jointly optimizes instructions and few-shot examples using Bayesian optimization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from dspy.teleprompt import MIPROv2

optimizer = MIPROv2(
    metric=exact_match,
    auto="medium",  # "light", "medium", or "heavy" search budgets
)

compiled_program = optimizer.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
)

Evaluate and Save Compiled Programs

Run evaluation on your dev set to measure the gain from optimization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=exact_match,
    num_threads=4,
    display_progress=True,
    display_table=5,
)

# Evaluate baseline (unoptimized)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score}")

# Evaluate compiled version
compiled_score = evaluator(compiled_program)
print(f"Compiled: {compiled_score}")

The display_table argument prints a table showing inputs, expected outputs, predictions, and whether each passed the metric. This makes it easy to spot failure patterns.

Save and reload compiled programs so you don’t re-optimize every time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Save the optimized program
compiled_program.save("optimized_qa_v1.json")

# Load it later
loaded_program = dspy.ChainOfThought("question -> answer")
loaded_program.load("optimized_qa_v1.json")

# Use it
result = loaded_program(question="What is the atomic number of carbon?")
print(result.answer)

The saved JSON contains the optimized instructions and bootstrapped demonstrations. You can version these alongside your code.

Full Example: Sentiment Classification Pipeline

Here’s a complete, end-to-end example that ties everything together — defining a typed signature, building training data, compiling with an optimizer, and evaluating the result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import dspy
from typing import Literal
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate

# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Define a typed signature
class ReviewSentiment(dspy.Signature):
    """Classify product review sentiment."""
    review: str = dspy.InputField(desc="a customer product review")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

# Training data
trainset = [
    dspy.Example(review="Amazing quality, exceeded expectations!", sentiment="positive").with_inputs("review"),
    dspy.Example(review="Broke after one week. Total garbage.", sentiment="negative").with_inputs("review"),
    dspy.Example(review="It works fine, nothing special.", sentiment="neutral").with_inputs("review"),
    dspy.Example(review="Best purchase I've made all year.", sentiment="positive").with_inputs("review"),
    dspy.Example(review="Arrived damaged and customer support ghosted me.", sentiment="negative").with_inputs("review"),
    dspy.Example(review="Decent for the price, does what it says.", sentiment="neutral").with_inputs("review"),
    dspy.Example(review="Five stars, would buy again.", sentiment="positive").with_inputs("review"),
    dspy.Example(review="Returned it immediately. Unusable.", sentiment="negative").with_inputs("review"),
]

devset = [
    dspy.Example(review="Pretty good but the manual is confusing.", sentiment="neutral").with_inputs("review"),
    dspy.Example(review="Love it! My whole family uses it now.", sentiment="positive").with_inputs("review"),
    dspy.Example(review="Screen cracked on day two.", sentiment="negative").with_inputs("review"),
]

# Metric
def sentiment_match(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

# Compile
program = dspy.Predict(ReviewSentiment)
optimizer = BootstrapFewShot(metric=sentiment_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled = optimizer.compile(student=program, trainset=trainset)

# Evaluate
evaluator = Evaluate(devset=devset, metric=sentiment_match, num_threads=2, display_progress=True)
score = evaluator(compiled)
print(f"Compiled accuracy: {score}")

# Save
compiled.save("sentiment_v1.json")

Common Errors and Fixes

ValueError: Too few labeled examples You need at least as many training examples as max_labeled_demos. If you set max_labeled_demos=16 but only have 5 examples, reduce the parameter:

1
optimizer = BootstrapFewShot(metric=my_metric, max_labeled_demos=4, max_bootstrapped_demos=2)

openai.AuthenticationError: Incorrect API key DSPy uses LiteLLM under the hood. Set OPENAI_API_KEY as an environment variable before running:

1
export OPENAI_API_KEY="sk-..."

Or pass it directly:

1
lm = dspy.LM("openai/gpt-4o-mini", api_key="sk-...")

AttributeError: 'str' object has no attribute 'answer' You’re probably accessing the result wrong. DSPy returns a Prediction object, not a string:

1
2
3
4
5
6
# Wrong
result = compiled("What is 2+2?")

# Right
result = compiled(question="What is 2+2?")
print(result.answer)

Compiled program performs worse than baseline This usually means your metric is too loose or your training set is too small. Tighten the metric to reject borderline outputs, and aim for at least 20 diverse training examples. Also check that with_inputs() marks only the input fields — if you accidentally mark output fields as inputs, the optimizer sees them during training and the metric becomes meaningless.

dspy.utils.DspyError: ... trace is None Some optimizers pass trace=None during evaluation but a non-None trace during bootstrapping. Always handle both cases in your metric:

1
2
3
4
5
def my_metric(example, pred, trace=None):
    score = example.answer.lower() == pred.answer.lower()
    if trace is not None:
        return score  # bootstrapping: return bool
    return score  # evaluation: return bool or float

Define Signatures and Modules#

Compile with Optimizers#

Evaluate and Save Compiled Programs#

Full Example: Sentiment Classification Pipeline#

Common Errors and Fixes#

Related Guides#

About the Author

Define Signatures and Modules

Compile with Optimizers

Evaluate and Save Compiled Programs

Full Example: Sentiment Classification Pipeline

Common Errors and Fixes

Related Guides