Ask an LLM to “think step by step and return JSON,” and you’ll get valid JSON maybe 80% of the time. The other 20%? Trailing commas, missing closing braces, extra markdown fences, or the model deciding to narrate its thoughts in prose before the JSON blob. Grammar-based constrained decoding fixes this by restricting the model’s token sampling to only tokens that produce valid output at every generation step. The model literally cannot emit a token that would break your schema.
This matters most for reasoning chains. You want structured thought/action/result triples – not a blob of free text you have to regex apart. Here’s how to enforce that with outlines for local models, Pydantic schemas for type safety, and OpenAI’s structured outputs for API-based workflows.
| |
Define Reasoning Chain Schemas with Pydantic
Before generating anything, define what a reasoning chain looks like. Pydantic models give you typed schemas that both outlines and OpenAI can consume directly.
| |
This schema forces the model to produce discrete reasoning steps with typed actions, not a wall of unstructured text. The ActionType enum constrains what actions are valid. The min_length=1 on steps means the model must reason at least once – no skipping straight to an answer.
Constrained Generation with Outlines and Local Models
The outlines library compiles your Pydantic schema into a finite-state machine that masks invalid tokens at each generation step. This gives you O(1) token validation overhead – the constraint checking doesn’t slow down generation meaningfully.
| |
The key thing happening here: outlines.generate.json(model, ReasoningChain) compiles the Pydantic schema into a token mask. At each decoding step, only tokens that keep the output on a valid path through the schema are allowed. The model can’t produce "action": "yell_loudly" because yell_loudly isn’t in the ActionType enum. It can’t skip the steps field or emit a confidence of 2.5.
You can also control whitespace formatting. By default, outlines produces compact JSON. If you want readable output:
| |
OpenAI Structured Outputs for Reasoning Chains
If you’re using the OpenAI API rather than local models, structured outputs give you the same schema enforcement server-side. OpenAI constrains token generation against your JSON schema during decoding – same principle as outlines, but running on their infrastructure.
The client.beta.chat.completions.parse() method accepts a Pydantic model directly and returns a parsed object:
| |
A few important constraints with the OpenAI approach. You cannot use arbitrary Pydantic validators (like ge, le, or regex constraints) – OpenAI only supports a subset of JSON Schema. Stick to basic types, Optional, list, Literal, and Enum. If you need tight numeric bounds, validate after parsing.
For the newer Responses API (used by GPT-5.2+), structured outputs move to the text.format parameter:
| |
How Grammar Constraints Fix Common Failure Modes
Without constrained decoding, LLMs fail at structured reasoning in predictable ways. Here’s what goes wrong and why grammars fix each one.
Partial JSON – The model hits a token limit mid-object and returns {"steps": [{"thought": "First I need to. Grammar constraints track nesting depth, so the model is forced to close all open brackets before the sequence ends. With outlines, the finite-state machine ensures every generated sequence is a complete, valid document.
Schema drift – The model invents new fields like "thinking_process" or renames "action" to "step_type". Grammar constraints only allow property names that exist in your schema. The token mask blocks any character sequence that doesn’t match a declared field name.
Type coercion errors – The model writes "confidence": "high" when you need a float. The grammar encodes type information, so after generating "confidence":, only digit tokens (and . for floats) are valid continuations.
Reasoning shortcuts – Without structure, models often skip steps and jump to an answer. A schema with min_length=1 on the steps array means the model must produce at least one reasoning step. You can increase this to force more thorough reasoning.
Enum violations – You define four valid actions, the model writes "action": "think_harder". Grammar constraints compiled from an enum only permit the literal string values you defined.
Common Errors and Fixes
outlines runs out of GPU memory on schema compilation
Large schemas with many nested objects create big finite-state machines. Simplify by flattening nested models or splitting into smaller generation calls:
| |
OpenAI rejects your Pydantic schema
OpenAI’s structured outputs don’t support all JSON Schema features. Common culprits:
min_length/max_lengthon arrays – remove these, validate client-sidepatternregex constraints on strings – not supported, use post-validationUniontypes with more than a few variants – simplify the union or useOptional- Missing
defaultvalues onOptionalfields – always setdefault=Noneexplicitly
Model produces degenerate reasoning (same step repeated)
This isn’t a grammar problem – it’s a prompting problem. The grammar guarantees structure, not quality. Fix with better system prompts:
| |
outlines generation is slow on first call
The first call compiles the schema into a finite-state machine and builds a token index. This is a one-time cost. Reuse the generator object across calls – don’t recreate it:
| |
Structured output looks correct but parsing fails
If you’re using the raw JSON schema approach (not Pydantic .parse()), remember to handle the refusal case with OpenAI:
| |
Related Guides
- How to Build LLM Output Validators with Instructor and Pydantic
- How to Build Structured Output Parsers with Pydantic and LLMs
- How to Use GPT-5.2 Structured Outputs for Reliable JSON
- How to Build Prompt Guardrails with Structured Output Schemas
- How to Build Prompt Chains with Tool Results and Structured Outputs
- How to Build Multi-Step Prompt Chains with Structured Outputs
- How to Build Dynamic Prompt Routers with LLM Cascading
- How to Build Chain-of-Thought Prompts That Actually Work
- How to Build Few-Shot Prompt Templates with Dynamic Examples
- How to Build Prompt Caching Strategies for Multi-Turn LLM Sessions