Ask an LLM to “think step by step and return JSON,” and you’ll get valid JSON maybe 80% of the time. The other 20%? Trailing commas, missing closing braces, extra markdown fences, or the model deciding to narrate its thoughts in prose before the JSON blob. Grammar-based constrained decoding fixes this by restricting the model’s token sampling to only tokens that produce valid output at every generation step. The model literally cannot emit a token that would break your schema.

This matters most for reasoning chains. You want structured thought/action/result triples – not a blob of free text you have to regex apart. Here’s how to enforce that with outlines for local models, Pydantic schemas for type safety, and OpenAI’s structured outputs for API-based workflows.

1
2
# Install the key dependencies
# pip install outlines transformers torch pydantic openai

Define Reasoning Chain Schemas with Pydantic

Before generating anything, define what a reasoning chain looks like. Pydantic models give you typed schemas that both outlines and OpenAI can consume directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional


class ActionType(str, Enum):
    SEARCH = "search"
    CALCULATE = "calculate"
    LOOKUP = "lookup"
    CONCLUDE = "conclude"


class ReasoningStep(BaseModel):
    thought: str = Field(description="What the model is considering at this step")
    action: ActionType = Field(description="The type of action to take")
    action_input: str = Field(description="Input for the action")
    result: Optional[str] = Field(
        default=None, description="Observation or result from the action"
    )


class ReasoningChain(BaseModel):
    question: str = Field(description="The original question being answered")
    steps: list[ReasoningStep] = Field(
        description="Ordered reasoning steps", min_length=1, max_length=8
    )
    final_answer: str = Field(description="The conclusive answer")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")

This schema forces the model to produce discrete reasoning steps with typed actions, not a wall of unstructured text. The ActionType enum constrains what actions are valid. The min_length=1 on steps means the model must reason at least once – no skipping straight to an answer.

Constrained Generation with Outlines and Local Models

The outlines library compiles your Pydantic schema into a finite-state machine that masks invalid tokens at each generation step. This gives you O(1) token validation overhead – the constraint checking doesn’t slow down generation meaningfully.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import outlines

model = outlines.models.transformers(
    "microsoft/Phi-3-mini-4k-instruct",
    device="auto",
)

generator = outlines.generate.json(model, ReasoningChain)

prompt = """You are a reasoning agent. Answer the question by working through it step by step.
Return a JSON object with your reasoning chain.

Question: What is the population of France divided by the number of US states?"""

chain = generator(prompt)

# chain is already a ReasoningChain instance, not a string
print(f"Question: {chain.question}")
for i, step in enumerate(chain.steps):
    print(f"Step {i + 1}: [{step.action.value}] {step.thought}")
    print(f"  Input: {step.action_input}")
    if step.result:
        print(f"  Result: {step.result}")
print(f"Answer: {chain.final_answer} (confidence: {chain.confidence})")

The key thing happening here: outlines.generate.json(model, ReasoningChain) compiles the Pydantic schema into a token mask. At each decoding step, only tokens that keep the output on a valid path through the schema are allowed. The model can’t produce "action": "yell_loudly" because yell_loudly isn’t in the ActionType enum. It can’t skip the steps field or emit a confidence of 2.5.

You can also control whitespace formatting. By default, outlines produces compact JSON. If you want readable output:

1
2
3
4
5
generator = outlines.generate.json(
    model,
    ReasoningChain,
    whitespace_pattern=r"[\n\t ]*",  # allow newlines, tabs, spaces
)

OpenAI Structured Outputs for Reasoning Chains

If you’re using the OpenAI API rather than local models, structured outputs give you the same schema enforcement server-side. OpenAI constrains token generation against your JSON schema during decoding – same principle as outlines, but running on their infrastructure.

The client.beta.chat.completions.parse() method accepts a Pydantic model directly and returns a parsed object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Optional


class ThoughtStep(BaseModel):
    thought: str = Field(description="Current reasoning")
    action: str = Field(description="Action: search, calculate, or conclude")
    action_input: str = Field(description="Input for the action")
    observation: Optional[str] = Field(default=None, description="Result of action")


class StructuredReasoning(BaseModel):
    steps: list[ThoughtStep]
    final_answer: str
    confidence: float


client = OpenAI()

completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a step-by-step reasoning agent. Break every problem into "
                "discrete thought/action/observation steps. Be precise and methodical."
            ),
        },
        {
            "role": "user",
            "content": "How many piano tuners are in Chicago? Use Fermi estimation.",
        },
    ],
    response_format=StructuredReasoning,
)

reasoning = completion.choices[0].message.parsed
print(f"Final answer: {reasoning.final_answer}")
print(f"Confidence: {reasoning.confidence}")
print(f"Steps taken: {len(reasoning.steps)}")
for step in reasoning.steps:
    print(f"  Thought: {step.thought}")
    print(f"  Action: {step.action} -> {step.action_input}")
    if step.observation:
        print(f"  Observation: {step.observation}")

A few important constraints with the OpenAI approach. You cannot use arbitrary Pydantic validators (like ge, le, or regex constraints) – OpenAI only supports a subset of JSON Schema. Stick to basic types, Optional, list, Literal, and Enum. If you need tight numeric bounds, validate after parsing.

For the newer Responses API (used by GPT-5.2+), structured outputs move to the text.format parameter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
response = client.responses.create(
    model="gpt-4o",
    input="How many piano tuners are in Chicago? Use Fermi estimation.",
    instructions="Break the problem into discrete reasoning steps.",
    text={
        "format": {
            "type": "json_schema",
            "name": "reasoning_chain",
            "strict": True,
            "schema": StructuredReasoning.model_json_schema(),
        }
    },
)

How Grammar Constraints Fix Common Failure Modes

Without constrained decoding, LLMs fail at structured reasoning in predictable ways. Here’s what goes wrong and why grammars fix each one.

Partial JSON – The model hits a token limit mid-object and returns {"steps": [{"thought": "First I need to. Grammar constraints track nesting depth, so the model is forced to close all open brackets before the sequence ends. With outlines, the finite-state machine ensures every generated sequence is a complete, valid document.

Schema drift – The model invents new fields like "thinking_process" or renames "action" to "step_type". Grammar constraints only allow property names that exist in your schema. The token mask blocks any character sequence that doesn’t match a declared field name.

Type coercion errors – The model writes "confidence": "high" when you need a float. The grammar encodes type information, so after generating "confidence":, only digit tokens (and . for floats) are valid continuations.

Reasoning shortcuts – Without structure, models often skip steps and jump to an answer. A schema with min_length=1 on the steps array means the model must produce at least one reasoning step. You can increase this to force more thorough reasoning.

Enum violations – You define four valid actions, the model writes "action": "think_harder". Grammar constraints compiled from an enum only permit the literal string values you defined.

Common Errors and Fixes

outlines runs out of GPU memory on schema compilation

Large schemas with many nested objects create big finite-state machines. Simplify by flattening nested models or splitting into smaller generation calls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Instead of one massive schema, generate steps one at a time
single_step_generator = outlines.generate.json(model, ReasoningStep)
steps = []
context = "Question: What is 15% of 340?\n"
for i in range(5):
    prompt = f"{context}Generate reasoning step {i + 1}:"
    step = single_step_generator(prompt)
    steps.append(step)
    context += f"Step {i + 1}: {step.thought} -> {step.result}\n"
    if step.action == ActionType.CONCLUDE:
        break

OpenAI rejects your Pydantic schema

OpenAI’s structured outputs don’t support all JSON Schema features. Common culprits:

  • min_length / max_length on arrays – remove these, validate client-side
  • pattern regex constraints on strings – not supported, use post-validation
  • Union types with more than a few variants – simplify the union or use Optional
  • Missing default values on Optional fields – always set default=None explicitly

Model produces degenerate reasoning (same step repeated)

This isn’t a grammar problem – it’s a prompting problem. The grammar guarantees structure, not quality. Fix with better system prompts:

1
2
3
4
5
system_prompt = """You are a precise reasoning agent. Rules:
- Each step must contain NEW information not present in previous steps.
- Use 'calculate' for any math, 'search' for factual lookups, 'conclude' for final step.
- The final step MUST have action 'conclude'.
- Do not repeat observations from prior steps."""

outlines generation is slow on first call

The first call compiles the schema into a finite-state machine and builds a token index. This is a one-time cost. Reuse the generator object across calls – don’t recreate it:

1
2
3
4
5
6
# Do this once at startup
generator = outlines.generate.json(model, ReasoningChain)

# Reuse for every request
result_1 = generator("Question 1: ...")
result_2 = generator("Question 2: ...")

Structured output looks correct but parsing fails

If you’re using the raw JSON schema approach (not Pydantic .parse()), remember to handle the refusal case with OpenAI:

1
2
3
4
5
message = completion.choices[0].message
if message.refusal:
    print(f"Model refused: {message.refusal}")
else:
    reasoning = message.parsed