Hand-tuning prompts is a dead end. You tweak wording, add examples, rearrange instructions — and the moment you swap models or change your data distribution, everything breaks. DSPy flips this by treating prompts as compiled artifacts. You define what you want (a signature), write how to compose reasoning steps (a module), and then let an optimizer find the best prompt configuration for your metric on your data.
Here’s the fastest way to get a working DSPy pipeline:
1
2
3
4
5
6
7
8
| import dspy
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
classify = dspy.Predict("text -> label: bool")
result = classify(text="I absolutely love this product!")
print(result.label) # True
|
That’s a zero-shot classifier in four lines. But the real power comes when you compile it with an optimizer against labeled data. The optimizer automatically discovers the best instructions, few-shot examples, and reasoning patterns for your specific task.
Define Signatures and Modules#
A Signature declares what goes in and what comes out. You can use inline strings for quick prototyping or class-based definitions when you need more control.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import dspy
from typing import Literal
# Inline signature — fast and simple
qa = dspy.ChainOfThought("question -> answer")
# Class-based signature — typed outputs, field descriptions, docstring as instruction
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a product review."""
review: str = dspy.InputField(desc="a product review from a customer")
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
classify = dspy.Predict(SentimentClassifier)
result = classify(review="The battery dies after two hours. Waste of money.")
print(result.sentiment) # negative
|
The docstring on the class becomes the task instruction sent to the LM. The desc on each field tells the model what that field represents. DSPy handles all the prompt formatting behind the scenes.
Modules wrap signatures with reasoning strategies:
dspy.Predict — straightforward input-to-output predictiondspy.ChainOfThought — adds a reasoning field so the model thinks step-by-step before answeringdspy.ReAct — agent loop that can call external toolsdspy.ProgramOfThought — generates and executes code to derive the answer
You compose modules into programs by subclassing dspy.Module:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| import dspy
class MultiHopQA(dspy.Module):
def __init__(self):
self.generate_query = dspy.ChainOfThought("context, question -> search_query")
self.generate_answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = ""
for _ in range(2):
query = self.generate_query(context=context, question=question).search_query
# In a real pipeline, you'd call a retriever here
context += f" [Search: {query}]"
return self.generate_answer(context=context, question=question)
|
This is standard Python. No prompt templates, no f-strings, no JSON wrangling.
Compile with Optimizers#
Optimizers (historically called “teleprompters”) take your program, a training set, and a metric, then search for the prompt configuration that maximizes your metric. The compiled program has the same interface — it just performs better.
First, you need training data as dspy.Example objects:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import dspy
trainset = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
dspy.Example(question="What is the boiling point of water in Celsius?", answer="100").with_inputs("question"),
dspy.Example(question="What planet is closest to the Sun?", answer="Mercury").with_inputs("question"),
dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci").with_inputs("question"),
dspy.Example(question="What is the chemical symbol for gold?", answer="Au").with_inputs("question"),
dspy.Example(question="What year did World War II end?", answer="1945").with_inputs("question"),
dspy.Example(question="What is the speed of light in km/s?", answer="299792").with_inputs("question"),
]
devset = [
dspy.Example(question="What is the largest ocean?", answer="Pacific").with_inputs("question"),
dspy.Example(question="Who discovered penicillin?", answer="Alexander Fleming").with_inputs("question"),
dspy.Example(question="What is the square root of 144?", answer="12").with_inputs("question"),
]
|
Then define a metric and compile:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from dspy.teleprompt import BootstrapFewShot
def exact_match(example, pred, trace=None):
return example.answer.lower().strip() in pred.answer.lower().strip()
program = dspy.ChainOfThought("question -> answer")
optimizer = BootstrapFewShot(
metric=exact_match,
max_bootstrapped_demos=4,
max_labeled_demos=4,
max_rounds=1,
)
compiled_program = optimizer.compile(student=program, trainset=trainset)
|
BootstrapFewShot works by running the teacher (your original program) on training examples, keeping the ones that pass the metric, and inserting those as few-shot demonstrations into the prompt. The compiled program now carries those demonstrations and uses them automatically.
For larger datasets (50+ examples), step up to BootstrapFewShotWithRandomSearch. For 200+ examples, use MIPROv2, which jointly optimizes instructions and few-shot examples using Bayesian optimization:
1
2
3
4
5
6
7
8
9
10
11
12
13
| from dspy.teleprompt import MIPROv2
optimizer = MIPROv2(
metric=exact_match,
auto="medium", # "light", "medium", or "heavy" search budgets
)
compiled_program = optimizer.compile(
program.deepcopy(),
trainset=trainset,
max_bootstrapped_demos=3,
max_labeled_demos=4,
)
|
Evaluate and Save Compiled Programs#
Run evaluation on your dev set to measure the gain from optimization:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=exact_match,
num_threads=4,
display_progress=True,
display_table=5,
)
# Evaluate baseline (unoptimized)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score}")
# Evaluate compiled version
compiled_score = evaluator(compiled_program)
print(f"Compiled: {compiled_score}")
|
The display_table argument prints a table showing inputs, expected outputs, predictions, and whether each passed the metric. This makes it easy to spot failure patterns.
Save and reload compiled programs so you don’t re-optimize every time:
1
2
3
4
5
6
7
8
9
10
| # Save the optimized program
compiled_program.save("optimized_qa_v1.json")
# Load it later
loaded_program = dspy.ChainOfThought("question -> answer")
loaded_program.load("optimized_qa_v1.json")
# Use it
result = loaded_program(question="What is the atomic number of carbon?")
print(result.answer)
|
The saved JSON contains the optimized instructions and bootstrapped demonstrations. You can version these alongside your code.
Full Example: Sentiment Classification Pipeline#
Here’s a complete, end-to-end example that ties everything together — defining a typed signature, building training data, compiling with an optimizer, and evaluating the result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| import dspy
from typing import Literal
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate
# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Define a typed signature
class ReviewSentiment(dspy.Signature):
"""Classify product review sentiment."""
review: str = dspy.InputField(desc="a customer product review")
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
# Training data
trainset = [
dspy.Example(review="Amazing quality, exceeded expectations!", sentiment="positive").with_inputs("review"),
dspy.Example(review="Broke after one week. Total garbage.", sentiment="negative").with_inputs("review"),
dspy.Example(review="It works fine, nothing special.", sentiment="neutral").with_inputs("review"),
dspy.Example(review="Best purchase I've made all year.", sentiment="positive").with_inputs("review"),
dspy.Example(review="Arrived damaged and customer support ghosted me.", sentiment="negative").with_inputs("review"),
dspy.Example(review="Decent for the price, does what it says.", sentiment="neutral").with_inputs("review"),
dspy.Example(review="Five stars, would buy again.", sentiment="positive").with_inputs("review"),
dspy.Example(review="Returned it immediately. Unusable.", sentiment="negative").with_inputs("review"),
]
devset = [
dspy.Example(review="Pretty good but the manual is confusing.", sentiment="neutral").with_inputs("review"),
dspy.Example(review="Love it! My whole family uses it now.", sentiment="positive").with_inputs("review"),
dspy.Example(review="Screen cracked on day two.", sentiment="negative").with_inputs("review"),
]
# Metric
def sentiment_match(example, pred, trace=None):
return example.sentiment.lower() == pred.sentiment.lower()
# Compile
program = dspy.Predict(ReviewSentiment)
optimizer = BootstrapFewShot(metric=sentiment_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled = optimizer.compile(student=program, trainset=trainset)
# Evaluate
evaluator = Evaluate(devset=devset, metric=sentiment_match, num_threads=2, display_progress=True)
score = evaluator(compiled)
print(f"Compiled accuracy: {score}")
# Save
compiled.save("sentiment_v1.json")
|
Common Errors and Fixes#
ValueError: Too few labeled examples
You need at least as many training examples as max_labeled_demos. If you set max_labeled_demos=16 but only have 5 examples, reduce the parameter:
1
| optimizer = BootstrapFewShot(metric=my_metric, max_labeled_demos=4, max_bootstrapped_demos=2)
|
openai.AuthenticationError: Incorrect API key
DSPy uses LiteLLM under the hood. Set OPENAI_API_KEY as an environment variable before running:
1
| export OPENAI_API_KEY="sk-..."
|
Or pass it directly:
1
| lm = dspy.LM("openai/gpt-4o-mini", api_key="sk-...")
|
AttributeError: 'str' object has no attribute 'answer'
You’re probably accessing the result wrong. DSPy returns a Prediction object, not a string:
1
2
3
4
5
6
| # Wrong
result = compiled("What is 2+2?")
# Right
result = compiled(question="What is 2+2?")
print(result.answer)
|
Compiled program performs worse than baseline
This usually means your metric is too loose or your training set is too small. Tighten the metric to reject borderline outputs, and aim for at least 20 diverse training examples. Also check that with_inputs() marks only the input fields — if you accidentally mark output fields as inputs, the optimizer sees them during training and the metric becomes meaningless.
dspy.utils.DspyError: ... trace is None
Some optimizers pass trace=None during evaluation but a non-None trace during bootstrapping. Always handle both cases in your metric:
1
2
3
4
5
| def my_metric(example, pred, trace=None):
score = example.answer.lower() == pred.answer.lower()
if trace is not None:
return score # bootstrapping: return bool
return score # evaluation: return bool or float
|