Chain-of-thought prompting is the single most effective technique for getting LLMs to reason through hard problems instead of guessing. The core idea: force the model to show intermediate steps before producing a final answer. Wei et al. showed in 2022 that adding a few reasoning exemplars to a prompt boosted GSM8K math accuracy from 18% to 57% on PaLM 540B. Kojima et al. found that just appending “Let’s think step by step” improves zero-shot arithmetic accuracy from the teens to 70-80%.
But here’s what most guides skip: CoT doesn’t always help, newer models have diminishing returns, and bad CoT prompts can actually hurt performance. This post covers what works, what doesn’t, and how to measure the difference with real code.
Zero-Shot CoT: The Simplest Win
Zero-shot CoT requires zero examples. You just tell the model to reason before answering. This is your baseline – try it first before building anything more complex.
| |
Without CoT, models frequently skip steps and land on the wrong number. With the step-by-step instruction, you’ll see the model lay out: base price ($20), discount ($3), subtotal ($17), tax ($1.36), final ($18.36). The reasoning chain acts as a self-check.
The exact trigger phrase matters less than you’d think. “Think step by step”, “Walk through your reasoning”, and “Show your work” all perform similarly. What matters is that you explicitly ask for intermediate steps before the final answer.
Few-Shot CoT: Exemplars That Teach Reasoning
Zero-shot works for straightforward problems. For domain-specific or tricky multi-step tasks, few-shot CoT – where you provide worked examples – is significantly more reliable. The key insight from the original Wei et al. paper: the exemplars teach the model the format of reasoning, not just the answer pattern.
| |
A few rules for building effective exemplars:
- Match the complexity. If your target problem has 4 steps, your exemplars should have 3-5 steps. One-step exemplars teach the model to take shortcuts.
- Show the math explicitly. Write out
2 * 3 = 6rather than jumping to the result. The model mirrors this explicitness. - Keep 2-4 exemplars. More isn’t better – past 5-6 examples, you’re burning context window for minimal accuracy gain.
- End each exemplar with a consistent format like “The answer is X.” This makes extraction trivial.
Self-Consistency: Vote Across Multiple Reasoning Paths
Wang et al. (2023) showed that sampling multiple CoT responses and taking a majority vote boosts accuracy by 12-18% on math benchmarks. The intuition: a correct answer can be reached through different valid reasoning paths, but wrong answers tend to be wrong in different ways.
This is my top recommendation for any production system where accuracy matters more than latency.
| |
The temperature=0.7 is deliberate. You need diverse reasoning paths for the vote to work. At temperature=0, you’ll get the same answer 5 times, which defeats the purpose. But don’t go above 0.9 – the reasoning gets too noisy.
Cost-wise, self-consistency multiplies your API spend by n_samples. For most problems, 5 samples is the sweet spot. Going to 10+ rarely changes the outcome but doubles the cost.
Structured CoT for Code Generation
When you need an LLM to write code, freeform CoT tends to produce rambling explanations. Structured CoT (SCoT) constrains the reasoning to follow program structure: identify inputs, plan control flow (sequence, branches, loops), then generate code. Research by Li et al. showed this improves Pass@1 by up to 16% on HumanEval.
| |
This structured approach forces the model to think about edge cases before writing a single line of code. Without it, you get solutions that work on happy paths but break on empty lists or single-element inputs.
When CoT Hurts Performance
CoT is not universally beneficial. Knowing when to skip it saves you tokens and avoids actively degrading your results.
Simple factual lookups. “What’s the capital of France?” doesn’t need reasoning steps. CoT adds latency and can introduce hallucinated reasoning that leads to a wrong answer.
Pattern recognition tasks. Research from Princeton (2024) showed that CoT reduces accuracy by up to 36% on implicit statistical learning tasks – problems where humans perform better with gut instinct than deliberate analysis. If the task is more about pattern matching than logic, skip CoT.
Already-reasoning models. The Wharton Generative AI Labs published findings in 2025 showing that models like o3-mini and o4-mini gain marginal accuracy from explicit CoT prompts (they already reason internally) but take 20-80% longer to respond. If you’re using a reasoning model, you’re paying the CoT cost already – don’t double it.
Short, single-step problems. If the answer requires one arithmetic operation or one lookup, CoT just adds noise. The Wharton study found that for simpler tasks, improvements were “small or even negative.”
My rule of thumb: use CoT when the problem requires 3+ reasoning steps. Below that threshold, direct prompting is faster and equally accurate.
Measuring CoT Effectiveness
Don’t guess whether CoT is helping. Measure it. Here’s a lightweight benchmark you can adapt for any domain:
| |
Run this on your actual use case, not generic benchmarks. CoT might boost your medical question-answering pipeline by 30% but do nothing for your classification task. The only way to know is to test with your data.
Common Errors and Fixes
“The model ignores my CoT instruction and gives a direct answer”
This happens most often with shorter prompts. The fix: put the reasoning instruction in the system message, not just the user message. Models weight system instructions more heavily.
| |
“CoT reasoning is correct but the final answer is wrong”
The model reasons through all the steps correctly, then botches the last line. This is frustratingly common. Fix it by adding a verification step to your prompt: “After reaching your answer, verify it by checking your arithmetic.”
“Self-consistency returns different answer formats”
When voting across multiple samples, you’ll get answers like “42”, “$42.00”, “42 dollars”, and “forty-two”. Normalize before counting. Strip currency symbols, convert words to numbers, and round floats consistently. The regex extraction in the self-consistency example above handles the most common case, but production code needs more aggressive normalization.
“API returns RateLimitError during self-consistency sampling”
Using n=5 in a single API call is fine, but if you’re running self-consistency across many questions in a loop, you’ll hit rate limits fast. Add exponential backoff:
| |
“CoT makes my responses too long for the context window”
Few-shot CoT exemplars eat tokens fast. If you’re running into context limits, switch from few-shot to zero-shot CoT (you’ll lose some accuracy but gain headroom), or move your exemplars to a system message with "role": "system" which some models handle more efficiently.
Picking the Right CoT Strategy
Here’s my opinionated decision tree:
- Simple question, fast response needed – Skip CoT entirely. Direct prompting.
- Multi-step reasoning, moderate accuracy – Zero-shot CoT. Just add “Think step by step.”
- Domain-specific reasoning, high accuracy – Few-shot CoT with 2-4 hand-crafted exemplars.
- Mission-critical accuracy, latency flexible – Self-consistency with 5 samples and majority vote.
- Code generation – Structured CoT with explicit planning steps before code output.
- Using o3-mini, o4-mini, or similar reasoning models – Skip explicit CoT. These models already do it internally.
Start with zero-shot CoT. If accuracy isn’t where you need it, add exemplars. If you need production-grade reliability, add self-consistency. Don’t over-engineer from the start.
Related Guides
- How to Build LLM Output Validators with Instructor and Pydantic
- How to Build Structured Output Parsers with Pydantic and LLMs
- How to Build Few-Shot Prompt Templates with Dynamic Examples
- How to Build Multi-Language Prompts with Automatic Translation
- How to Use Claude Sonnet 4.6’s 1M Token Context Window for Long-Document Reasoning
- How to Build Automatic Prompt Optimization with DSPy
- How to Build Structured Reasoning Chains with LLM Grammars
- How to Build Dynamic Prompt Routers with LLM Cascading
- How to Build Prompt Guardrails with Structured Output Schemas
- How to Write Effective System Prompts for LLMs