Chain-of-thought prompting is the single most effective technique for getting LLMs to reason through hard problems instead of guessing. The core idea: force the model to show intermediate steps before producing a final answer. Wei et al. showed in 2022 that adding a few reasoning exemplars to a prompt boosted GSM8K math accuracy from 18% to 57% on PaLM 540B. Kojima et al. found that just appending “Let’s think step by step” improves zero-shot arithmetic accuracy from the teens to 70-80%.

But here’s what most guides skip: CoT doesn’t always help, newer models have diminishing returns, and bad CoT prompts can actually hurt performance. This post covers what works, what doesn’t, and how to measure the difference with real code.

Zero-Shot CoT: The Simplest Win

Zero-shot CoT requires zero examples. You just tell the model to reason before answering. This is your baseline – try it first before building anything more complex.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from openai import OpenAI

client = OpenAI()

def ask_with_cot(question: str, use_cot: bool = True) -> str:
    """Compare responses with and without chain-of-thought."""
    if use_cot:
        prompt = f"{question}\n\nThink through this step by step before giving your final answer."
    else:
        prompt = question

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content

# Test on a multi-step math problem
question = """A store sells notebooks for $4 each. If you buy 3 or more,
you get 15% off the total. Tax is 8%. How much do you pay for 5 notebooks?"""

print("=== Without CoT ===")
print(ask_with_cot(question, use_cot=False))

print("\n=== With CoT ===")
print(ask_with_cot(question, use_cot=True))

Without CoT, models frequently skip steps and land on the wrong number. With the step-by-step instruction, you’ll see the model lay out: base price ($20), discount ($3), subtotal ($17), tax ($1.36), final ($18.36). The reasoning chain acts as a self-check.

The exact trigger phrase matters less than you’d think. “Think step by step”, “Walk through your reasoning”, and “Show your work” all perform similarly. What matters is that you explicitly ask for intermediate steps before the final answer.

Few-Shot CoT: Exemplars That Teach Reasoning

Zero-shot works for straightforward problems. For domain-specific or tricky multi-step tasks, few-shot CoT – where you provide worked examples – is significantly more reliable. The key insight from the original Wei et al. paper: the exemplars teach the model the format of reasoning, not just the answer pattern.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from openai import OpenAI

client = OpenAI()

FEW_SHOT_EXEMPLARS = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many does he have now?

A: Roger started with 5 balls. He bought 2 cans, each with 3 balls,
so he got 2 * 3 = 6 new balls. Total: 5 + 6 = 11 tennis balls.
The answer is 11.

Q: A restaurant had 23 guests at lunch. 5 left, then 12 more arrived
for dinner. 3 of the dinner guests left early. How many guests remain?

A: Started with 23 guests. After 5 left: 23 - 5 = 18. After 12 arrived:
18 + 12 = 30. After 3 left early: 30 - 3 = 27 guests remain.
The answer is 27."""


def few_shot_cot(question: str) -> str:
    prompt = f"""{FEW_SHOT_EXEMPLARS}

Q: {question}

A:"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content


result = few_shot_cot(
    "A farmer has 3 fields. Each field has 12 rows of corn with 8 plants per row. "
    "He loses 15% of all plants to drought. How many plants survive?"
)
print(result)

A few rules for building effective exemplars:

  • Match the complexity. If your target problem has 4 steps, your exemplars should have 3-5 steps. One-step exemplars teach the model to take shortcuts.
  • Show the math explicitly. Write out 2 * 3 = 6 rather than jumping to the result. The model mirrors this explicitness.
  • Keep 2-4 exemplars. More isn’t better – past 5-6 examples, you’re burning context window for minimal accuracy gain.
  • End each exemplar with a consistent format like “The answer is X.” This makes extraction trivial.

Self-Consistency: Vote Across Multiple Reasoning Paths

Wang et al. (2023) showed that sampling multiple CoT responses and taking a majority vote boosts accuracy by 12-18% on math benchmarks. The intuition: a correct answer can be reached through different valid reasoning paths, but wrong answers tend to be wrong in different ways.

This is my top recommendation for any production system where accuracy matters more than latency.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import re
from collections import Counter
from openai import OpenAI

client = OpenAI()


def self_consistency_cot(question: str, n_samples: int = 5) -> dict:
    """Sample multiple CoT reasoning paths and take a majority vote."""
    prompt = f"""{question}

Think through this step by step. Show your reasoning, then end with
'The answer is <number>.' on its own line."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Higher temp = more diverse reasoning paths
        n=n_samples,
    )

    answers = []
    reasoning_paths = []

    for choice in response.choices:
        text = choice.message.content
        reasoning_paths.append(text)

        # Extract the final numeric answer
        match = re.search(r"The answer is [\$]?([\d,]+\.?\d*)", text)
        if match:
            answers.append(match.group(1).replace(",", ""))

    vote_counts = Counter(answers)
    winner = vote_counts.most_common(1)[0] if vote_counts else ("No answer", 0)

    return {
        "final_answer": winner[0],
        "vote_count": winner[1],
        "total_samples": n_samples,
        "confidence": winner[1] / n_samples if answers else 0,
        "all_answers": dict(vote_counts),
    }


result = self_consistency_cot(
    "A bookstore sells hardcovers for $24 and paperbacks for $9. "
    "Maria buys 3 hardcovers and 7 paperbacks. Members get 20% off "
    "hardcovers only. Maria is a member. How much does she spend?"
)
print(f"Answer: ${result['final_answer']}")
print(f"Confidence: {result['confidence']:.0%} ({result['vote_count']}/{result['total_samples']})")
print(f"Vote distribution: {result['all_answers']}")

The temperature=0.7 is deliberate. You need diverse reasoning paths for the vote to work. At temperature=0, you’ll get the same answer 5 times, which defeats the purpose. But don’t go above 0.9 – the reasoning gets too noisy.

Cost-wise, self-consistency multiplies your API spend by n_samples. For most problems, 5 samples is the sweet spot. Going to 10+ rarely changes the outcome but doubles the cost.

Structured CoT for Code Generation

When you need an LLM to write code, freeform CoT tends to produce rambling explanations. Structured CoT (SCoT) constrains the reasoning to follow program structure: identify inputs, plan control flow (sequence, branches, loops), then generate code. Research by Li et al. showed this improves Pass@1 by up to 16% on HumanEval.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from openai import OpenAI

client = OpenAI()

SCOT_SYSTEM_PROMPT = """You are a Python developer. When given a coding task,
reason through it using this exact structure before writing code:

1. INPUTS: What data comes in? What are the types and constraints?
2. PLAN: Break the algorithm into steps using only:
   - Sequential steps (do A, then B)
   - Branches (if condition, do X, else do Y)
   - Loops (for each item in collection, do Z)
3. EDGE CASES: What can go wrong? Empty inputs, negative numbers, type errors?
4. CODE: Write the final Python function with type hints.
5. TEST: Show 2-3 test cases with expected output."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SCOT_SYSTEM_PROMPT},
        {
            "role": "user",
            "content": "Write a function that finds the longest consecutive "
            "sequence in an unsorted list of integers.",
        },
    ],
    temperature=0,
)
print(response.choices[0].message.content)

This structured approach forces the model to think about edge cases before writing a single line of code. Without it, you get solutions that work on happy paths but break on empty lists or single-element inputs.

When CoT Hurts Performance

CoT is not universally beneficial. Knowing when to skip it saves you tokens and avoids actively degrading your results.

Simple factual lookups. “What’s the capital of France?” doesn’t need reasoning steps. CoT adds latency and can introduce hallucinated reasoning that leads to a wrong answer.

Pattern recognition tasks. Research from Princeton (2024) showed that CoT reduces accuracy by up to 36% on implicit statistical learning tasks – problems where humans perform better with gut instinct than deliberate analysis. If the task is more about pattern matching than logic, skip CoT.

Already-reasoning models. The Wharton Generative AI Labs published findings in 2025 showing that models like o3-mini and o4-mini gain marginal accuracy from explicit CoT prompts (they already reason internally) but take 20-80% longer to respond. If you’re using a reasoning model, you’re paying the CoT cost already – don’t double it.

Short, single-step problems. If the answer requires one arithmetic operation or one lookup, CoT just adds noise. The Wharton study found that for simpler tasks, improvements were “small or even negative.”

My rule of thumb: use CoT when the problem requires 3+ reasoning steps. Below that threshold, direct prompting is faster and equally accurate.

Measuring CoT Effectiveness

Don’t guess whether CoT is helping. Measure it. Here’s a lightweight benchmark you can adapt for any domain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from openai import OpenAI

client = OpenAI()

TEST_CASES = [
    {
        "question": "A train travels 120 miles in 2 hours, then 90 miles in 1.5 hours. What is the average speed for the entire trip?",
        "answer": "60",
    },
    {
        "question": "You have 3 red marbles and 5 blue marbles. You draw 2 without replacement. What's the probability both are red?",
        "answer": "3/28",
    },
    {
        "question": "A rectangle's length is 3 times its width. The perimeter is 48cm. What is the area?",
        "answer": "108",
    },
]


def evaluate(use_cot: bool, test_cases: list[dict]) -> float:
    correct = 0
    for tc in test_cases:
        prompt = tc["question"]
        if use_cot:
            prompt += "\n\nThink step by step, then give your final answer after 'ANSWER:'."

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        result = response.choices[0].message.content

        if tc["answer"] in result:
            correct += 1

    return correct / len(test_cases)


baseline = evaluate(use_cot=False, TEST_CASES)
with_cot = evaluate(use_cot=True, TEST_CASES)

print(f"Baseline accuracy: {baseline:.0%}")
print(f"CoT accuracy:      {with_cot:.0%}")
print(f"Improvement:       {with_cot - baseline:+.0%}")

Run this on your actual use case, not generic benchmarks. CoT might boost your medical question-answering pipeline by 30% but do nothing for your classification task. The only way to know is to test with your data.

Common Errors and Fixes

“The model ignores my CoT instruction and gives a direct answer”

This happens most often with shorter prompts. The fix: put the reasoning instruction in the system message, not just the user message. Models weight system instructions more heavily.

1
2
3
4
5
6
7
8
# Bad: CoT instruction gets ignored
messages = [{"role": "user", "content": "What is 15% of 340? Think step by step."}]

# Good: CoT instruction in system message
messages = [
    {"role": "system", "content": "Always show your reasoning step by step before giving a final answer."},
    {"role": "user", "content": "What is 15% of 340?"},
]

“CoT reasoning is correct but the final answer is wrong”

The model reasons through all the steps correctly, then botches the last line. This is frustratingly common. Fix it by adding a verification step to your prompt: “After reaching your answer, verify it by checking your arithmetic.”

“Self-consistency returns different answer formats”

When voting across multiple samples, you’ll get answers like “42”, “$42.00”, “42 dollars”, and “forty-two”. Normalize before counting. Strip currency symbols, convert words to numbers, and round floats consistently. The regex extraction in the self-consistency example above handles the most common case, but production code needs more aggressive normalization.

“API returns RateLimitError during self-consistency sampling”

Using n=5 in a single API call is fine, but if you’re running self-consistency across many questions in a loop, you’ll hit rate limits fast. Add exponential backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import time
from openai import RateLimitError

def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
    raise RateLimitError("Max retries exceeded")

“CoT makes my responses too long for the context window”

Few-shot CoT exemplars eat tokens fast. If you’re running into context limits, switch from few-shot to zero-shot CoT (you’ll lose some accuracy but gain headroom), or move your exemplars to a system message with "role": "system" which some models handle more efficiently.

Picking the Right CoT Strategy

Here’s my opinionated decision tree:

  • Simple question, fast response needed – Skip CoT entirely. Direct prompting.
  • Multi-step reasoning, moderate accuracy – Zero-shot CoT. Just add “Think step by step.”
  • Domain-specific reasoning, high accuracy – Few-shot CoT with 2-4 hand-crafted exemplars.
  • Mission-critical accuracy, latency flexible – Self-consistency with 5 samples and majority vote.
  • Code generation – Structured CoT with explicit planning steps before code output.
  • Using o3-mini, o4-mini, or similar reasoning models – Skip explicit CoT. These models already do it internally.

Start with zero-shot CoT. If accuracy isn’t where you need it, add exemplars. If you need production-grade reliability, add self-consistency. Don’t over-engineer from the start.