How to Evaluate LLM Outputs with DeepEval and Custom Metrics

You ship an LLM feature, it works great in demos, then a user triggers a hallucination that costs you a support ticket (or worse). The fix is not more prompt tweaking – it is automated evaluation. DeepEval gives you pytest-style test suites for LLM outputs, with research-backed metrics for hallucination, faithfulness, relevance, and whatever custom criteria you need.

Install DeepEval and Run Your First Test

1
pip install -U deepeval

DeepEval’s LLM-as-a-judge metrics need an API key. Set it before running anything:

1
export OPENAI_API_KEY="sk-..."

If you skip this, you will get an error like:

1
openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-...', 'type': 'invalid_request_error'}}

DeepEval auto-loads .env.local then .env from the current working directory at import time, so you can also drop your key there.

Here is a minimal evaluation using GEval, DeepEval’s most flexible metric. It uses chain-of-thought prompting to judge LLM output against any criteria you define:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

def test_refund_policy_correctness():
    correctness = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is factually correct based on the 'expected output'.",
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
        threshold=0.5,
    )

    test_case = LLMTestCase(
        input="What is your return policy?",
        actual_output="You have 30 days to return items for a full refund at no extra cost.",
        expected_output="All customers are eligible for a 30-day full refund at no extra cost.",
    )

    assert_test(test_case, [correctness])

Run it with:

1
deepeval test run test_eval.py

This command wraps pytest under the hood. You get pass/fail results, scores, and reasoning for each metric – directly in your terminal.

Detect Hallucinations with Built-In Metrics

The HallucinationMetric compares the model’s output against a provided context. If the output contains claims not supported by the context, the score drops. This is the metric you want when your LLM is supposed to answer strictly from a knowledge base.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_no_hallucination():
    metric = HallucinationMetric(
        threshold=0.5,
        model="gpt-4.1",
        include_reason=True,
    )

    test_case = LLMTestCase(
        input="Who founded the company?",
        actual_output="The company was founded by Jane Smith in 2019.",
        context=[
            "The company was founded by John Doe in 2020. Jane Smith joined as CTO in 2021."
        ],
    )

    metric.measure(test_case)
    print(f"Score: {metric.score}")   # 0.0 - 1.0, lower means more hallucination
    print(f"Reason: {metric.reason}")
    assert metric.is_successful()

That test case will fail because the output attributes the founding to Jane Smith (CTO, not founder) and says 2019 instead of 2020. The metric catches both factual errors.

Measure Faithfulness for RAG Pipelines

If you are building retrieval-augmented generation, FaithfulnessMetric is more targeted than the hallucination metric. It specifically checks whether the output contradicts the retrieval context – the actual documents your RAG system pulled.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_rag_faithfulness():
    metric = FaithfulnessMetric(
        threshold=0.7,
        model="gpt-4.1",
        include_reason=True,
    )

    test_case = LLMTestCase(
        input="What are the side effects of ibuprofen?",
        actual_output="Common side effects include stomach pain, nausea, and dizziness. "
                      "It can also cause liver failure in rare cases.",
        retrieval_context=[
            "Ibuprofen side effects include stomach pain, nausea, dizziness, and headache. "
            "Serious but rare side effects include kidney problems and gastrointestinal bleeding."
        ],
    )

    metric.measure(test_case)
    assert metric.is_successful()

This will flag the “liver failure” claim because the retrieval context mentions kidney problems, not liver failure. The distinction between context (for hallucination) and retrieval_context (for faithfulness) matters – use the right parameter for the right metric.

Build a Custom Metric

DeepEval’s built-in metrics cover the common cases, but you will eventually need something domain-specific. Maybe you need to check that outputs never mention competitor products, or that medical responses always include a disclaimer. Create a custom metric by subclassing BaseMetric:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase


class DisclaimerMetric(BaseMetric):
    """Checks that medical responses include a disclaimer."""

    def __init__(self, threshold: float = 1.0):
        self.threshold = threshold

    def measure(self, test_case: LLMTestCase) -> float:
        disclaimer_phrases = [
            "consult a doctor",
            "medical professional",
            "not medical advice",
            "seek professional",
            "talk to your doctor",
        ]
        output_lower = test_case.actual_output.lower()
        has_disclaimer = any(phrase in output_lower for phrase in disclaimer_phrases)

        self.score = 1.0 if has_disclaimer else 0.0
        self.reason = (
            "Output contains medical disclaimer."
            if has_disclaimer
            else "Output is missing a medical disclaimer."
        )
        self.success = self.score >= self.threshold
        return self.score

    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)

    def is_successful(self) -> bool:
        if self.error is not None:
            self.success = False
        else:
            self.success = self.score >= self.threshold
        return self.success

    @property
    def __name__(self):
        return "Disclaimer Check"

Use it exactly like any built-in metric:

1
2
3
4
5
6
7
def test_medical_disclaimer():
    metric = DisclaimerMetric(threshold=1.0)
    test_case = LLMTestCase(
        input="What should I take for a headache?",
        actual_output="You can take ibuprofen or acetaminophen for a headache.",
    )
    assert_test(test_case, [metric])

That test will fail because the output has no disclaimer. The custom metric slots into assert_test, evaluate(), and CI/CD pipelines with zero extra wiring.

Run Multiple Metrics on a Dataset

Real evaluation means running several metrics across many test cases. Use EvaluationDataset with pytest parametrization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import HallucinationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

dataset = EvaluationDataset(test_cases=[
    LLMTestCase(
        input="What is our refund policy?",
        actual_output="Full refund within 30 days.",
        expected_output="30-day full refund at no extra cost.",
        context=["All customers get a 30-day full refund at no extra cost."],
    ),
    LLMTestCase(
        input="Do you ship internationally?",
        actual_output="Yes, we ship to over 50 countries.",
        expected_output="We ship to 50+ countries worldwide.",
        context=["We offer international shipping to over 50 countries."],
    ),
])

hallucination = HallucinationMetric(threshold=0.5)
correctness = GEval(
    name="Correctness",
    criteria="Is the actual output factually aligned with the expected output?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    threshold=0.5,
)

@pytest.mark.parametrize("test_case", dataset, ids=lambda tc: tc.input[:40])
def test_customer_support_bot(test_case: LLMTestCase):
    assert_test(test_case, [hallucination, correctness])

Run the full suite:

1
deepeval test run test_eval.py -v

Each test case gets evaluated against every metric. Failures show which metric failed, the score, and the reasoning – so you know exactly what went wrong.

Common Errors and Fixes

AssertionError from assert_test: This means a metric score fell below its threshold. Check metric.score and metric.reason for details. Lower the threshold during development, then tighten it as your system improves.

Rate limit errors (HTTP 429): DeepEval retries with exponential backoff (initial 1s, base 2, jitter 2s, cap 5s), but if you are running hundreds of test cases, you will hit OpenAI rate limits. Use async_mode=True (the default) and consider batching your dataset.

Wrong context parameter: HallucinationMetric uses context (a list of strings). FaithfulnessMetric uses retrieval_context. Mix these up and you will get a silent pass because the metric has nothing to evaluate against. Always double-check which parameter your metric expects.

Slow evaluation runs: Each LLM-as-a-judge metric makes API calls. For large test suites, use a faster model like gpt-4.1-mini during development:

1
metric = HallucinationMetric(threshold=0.5, model="gpt-4.1-mini")

Switch to gpt-4.1 or o1 for your CI gate where accuracy matters more than speed.

Install DeepEval and Run Your First Test#

Detect Hallucinations with Built-In Metrics#

Measure Faithfulness for RAG Pipelines#

Build a Custom Metric#

Run Multiple Metrics on a Dataset#

Common Errors and Fixes#

Related Guides#

About the Author