The Core Idea

LLMs mess up. They return broken JSON, hallucinate fields, ignore schema constraints, and go off-topic. Instead of crashing or returning garbage to users, you can build chains that catch these failures and retry with the error fed back into the prompt. The LLM gets a second (or third) shot at producing correct output, and your system stays reliable.

The pattern is simple: call the LLM, validate the output, and if validation fails, re-prompt with the error message attached. Most issues resolve within 1-2 retries.

Structured Outputs with Pydantic Validation

Start by defining what “correct” looks like. Pydantic models are the best tool for this because they give you typed fields, custom validators, and clear error messages you can feed back to the LLM.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pydantic import BaseModel, field_validator
from typing import Literal


class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    reasoning: str

    @field_validator("confidence")
    @classmethod
    def confidence_in_range(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError(f"Confidence must be between 0.0 and 1.0, got {v}")
        return round(v, 3)

    @field_validator("reasoning")
    @classmethod
    def reasoning_not_empty(cls, v: str) -> str:
        if len(v.strip()) < 10:
            raise ValueError("Reasoning must be at least 10 characters with actual explanation")
        return v.strip()

This model rejects confidence scores outside 0-1, enforces a minimum reasoning length, and restricts sentiment to three allowed values. When validation fails, Pydantic gives you a structured error message that tells the LLM exactly what went wrong.

The Retry Decorator

Here’s a reusable decorator that handles retries with exponential backoff. It catches both Pydantic validation errors and JSON parsing failures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import time
import json
import functools
from pydantic import ValidationError


def llm_retry(max_retries: int = 3, base_delay: float = 1.0):
    """Retry decorator for LLM calls with exponential backoff and error feedback."""

    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_error = None

            for attempt in range(max_retries + 1):
                if attempt > 0:
                    delay = base_delay * (2 ** (attempt - 1))
                    time.sleep(delay)

                try:
                    # Pass the previous error to the function so it can include it in the prompt
                    kwargs["_correction_hint"] = str(last_error) if last_error else None
                    return func(*args, **kwargs)

                except (ValidationError, json.JSONDecodeError, KeyError) as e:
                    last_error = e
                    print(f"Attempt {attempt + 1}/{max_retries + 1} failed: {e}")

            raise RuntimeError(
                f"All {max_retries + 1} attempts failed. Last error: {last_error}"
            )

        return wrapper

    return decorator

The key detail: each retry passes the error message to the function via _correction_hint. The function uses this to build a corrective prompt.

Self-Healing Prompt Chain

Now wire the pieces together. The LLM call function receives the correction hint and appends it to the conversation, giving the model explicit feedback about what went wrong.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from openai import OpenAI

client = OpenAI()


@llm_retry(max_retries=3, base_delay=0.5)
def analyze_sentiment(text: str, _correction_hint: str = None) -> SentimentResult:
    messages = [
        {
            "role": "system",
            "content": (
                "You are a sentiment analysis engine. Return valid JSON matching this schema exactly:\n"
                '{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "reasoning": "string"}\n'
                "No markdown, no explanation outside the JSON."
            ),
        },
        {"role": "user", "content": f"Analyze the sentiment of this text:\n\n{text}"},
    ]

    # Feed the error back to the LLM on retries
    if _correction_hint:
        messages.append(
            {
                "role": "user",
                "content": (
                    f"Your previous response failed validation with this error:\n"
                    f"{_correction_hint}\n\n"
                    f"Fix the issue and return valid JSON."
                ),
            }
        )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.1,
    )

    raw = response.choices[0].message.content.strip()

    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()

    parsed = json.loads(raw)
    return SentimentResult(**parsed)

On the first call, the LLM just gets the normal prompt. If it returns broken JSON or fails validation, the retry loop catches the error, waits, and calls again with the error message appended. The model now knows exactly what went wrong – “confidence must be between 0.0 and 1.0, got 1.5” is a clear instruction to fix.

Validator Functions for Complex Checks

Sometimes you need validation beyond what Pydantic handles. Build standalone validator functions that check semantic correctness.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def validate_sentiment_response(result: SentimentResult, original_text: str) -> None:
    """Run extra checks that go beyond schema validation."""

    # Check for contradictory reasoning
    negative_words = {"terrible", "awful", "worst", "hate", "horrible"}
    positive_words = {"great", "excellent", "love", "amazing", "best"}

    reasoning_lower = result.reasoning.lower()

    if result.sentiment == "positive" and any(w in reasoning_lower for w in negative_words):
        raise ValueError(
            f"Contradiction: sentiment is 'positive' but reasoning contains negative language. "
            f"Re-evaluate the text."
        )

    if result.sentiment == "negative" and any(w in reasoning_lower for w in positive_words):
        raise ValueError(
            f"Contradiction: sentiment is 'negative' but reasoning contains positive language. "
            f"Re-evaluate the text."
        )

    # Check confidence makes sense
    if result.sentiment == "neutral" and result.confidence > 0.95:
        raise ValueError(
            "A 'neutral' sentiment with >0.95 confidence is suspicious. "
            "Re-evaluate whether the text is truly neutral."
        )

Plug this into the chain by calling it after Pydantic validation succeeds:

1
2
3
4
5
@llm_retry(max_retries=3, base_delay=0.5)
def analyze_sentiment_strict(text: str, _correction_hint: str = None) -> SentimentResult:
    result = analyze_sentiment.__wrapped__(text, _correction_hint=_correction_hint)
    validate_sentiment_response(result, text)
    return result

Now the chain catches both structural errors (bad JSON, wrong types) and semantic errors (contradictory reasoning).

Running the Full Chain

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
if __name__ == "__main__":
    test_texts = [
        "This product completely exceeded my expectations. Best purchase this year.",
        "The service was okay, nothing special but nothing terrible either.",
        "Absolute disaster. Waited 3 hours and the food was cold.",
    ]

    for text in test_texts:
        try:
            result = analyze_sentiment(text)
            print(f"Text: {text[:50]}...")
            print(f"  Sentiment: {result.sentiment} ({result.confidence})")
            print(f"  Reasoning: {result.reasoning}\n")
        except RuntimeError as e:
            print(f"Failed after all retries: {e}\n")

This processes each text, retrying failures automatically. Most malformed outputs get fixed on the first retry because the LLM can see what it did wrong.

Common Errors and Fixes

json.JSONDecodeError on every attempt: The model keeps wrapping output in markdown code fences. The code fence stripping logic above handles this, but if you’re using a different model, check what prefix it adds. Some models output json\n{...}\n instead of \``json`.

Pydantic ValidationError for missing fields: The model sometimes omits optional-looking fields. Make your system prompt explicit about every required field. List them out. Don’t rely on the model inferring from examples alone.

RateLimitError masking validation retries: Your retry decorator catches ValidationError and JSONDecodeError, but rate limit errors from the API need separate handling. Add openai.RateLimitError to your except clause, or better, use the tenacity library for API-level retries and keep your decorator focused on validation retries.

1
2
3
4
5
6
7
8
9
from openai import RateLimitError

# In the decorator, handle rate limits differently
except RateLimitError as e:
    last_error = e
    delay = base_delay * (2 ** attempt)  # Longer backoff for rate limits
    time.sleep(delay)
except (ValidationError, json.JSONDecodeError) as e:
    last_error = e

Infinite correction loops: If the model can’t fix a particular error after 3 tries, it probably won’t fix it on attempt 4. Three retries is a good default. Beyond that, fall back to a different prompt template or a more capable model.

Temperature too high on retries: Set temperature to 0.0 or 0.1 on correction attempts. High temperature introduces randomness that works against targeted fixes. You want the model to focus on the specific error, not generate a completely new creative response.

When to Use This Pattern

Self-correcting chains work best when the output has a clear schema or validation rules. JSON extraction, classification, entity extraction, and structured data generation are all good fits. They work less well for open-ended creative tasks where “correct” is subjective.

The overhead is minimal – you only pay for extra API calls when the first attempt fails. With gpt-4o and clear prompts, first-attempt success rates typically run above 95% for well-defined schemas. The retry logic is insurance for the other 5%.