Single-shot prompting has a ceiling. You ask one model one question, you get one perspective. Multi-agent debate breaks through that ceiling by forcing multiple LLM agents to argue different sides of a question, then having a judge pick the strongest answer. Research from MIT and Google has shown this approach improves factual accuracy and reasoning quality, especially on math, logic, and open-ended questions where a single model tends to lock into its first guess.

Here is the minimal version. Two agents debate, a judge decides:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from openai import OpenAI

client = OpenAI()

def quick_debate(question: str) -> str:
    pro = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You argue IN FAVOR of the most common answer to the user's question. Be concise and logical."},
            {"role": "user", "content": question},
        ],
    ).choices[0].message.content

    con = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You argue AGAINST the most common answer to the user's question. Find flaws and present an alternative. Be concise and logical."},
            {"role": "user", "content": question},
        ],
    ).choices[0].message.content

    verdict = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an impartial judge. Read both arguments and pick the one with stronger logic and evidence. State your verdict clearly."},
            {"role": "user", "content": f"Question: {question}\n\nArgument A (Pro):\n{pro}\n\nArgument B (Con):\n{con}"},
        ],
    ).choices[0].message.content

    return verdict

print(quick_debate("Is it better to microservice or monolith a new startup backend?"))

That works, but it is one round with no back-and-forth. The real power comes when agents can see and respond to each other’s arguments across multiple rounds. Let’s build that properly.

Defining Agent Roles

Each debate agent needs a stance and the ability to track conversation history. Wrapping this in a class keeps things clean:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from openai import OpenAI

client = OpenAI()


class DebateAgent:
    def __init__(self, name: str, stance: str, model: str = "gpt-4o"):
        self.name = name
        self.model = model
        self.system_prompt = (
            f"You are {name}, a debate participant. Your assigned stance: {stance}. "
            f"Argue your position with clear logic and specific evidence. "
            f"When responding to opponents, address their points directly before making your own. "
            f"Keep responses under 200 words."
        )
        self.history: list[dict[str, str]] = [
            {"role": "system", "content": self.system_prompt}
        ]

    def argue(self, prompt: str) -> str:
        self.history.append({"role": "user", "content": prompt})
        response = client.chat.completions.create(
            model=self.model,
            messages=self.history,
            temperature=0.7,
            max_tokens=500,
        )
        reply = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": reply})
        return reply

    def reset(self):
        self.history = [{"role": "system", "content": self.system_prompt}]

The history list is key. Each agent accumulates context from the full debate, so later rounds reference earlier arguments. The temperature=0.7 gives enough creativity to find novel angles without going off the rails.

Running a Debate Round

A debate round sends each agent the opponent’s latest argument and collects their response. Multiple rounds let agents refine their positions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def debate_round(
    agents: list[DebateAgent],
    question: str,
    num_rounds: int = 3,
) -> list[dict]:
    transcript: list[dict] = []

    # Opening statements
    for agent in agents:
        opening = agent.argue(f"Question: {question}\n\nMake your opening argument.")
        transcript.append({
            "round": 0,
            "agent": agent.name,
            "argument": opening,
        })
        print(f"[Round 0] {agent.name}:\n{opening}\n{'─' * 40}")

    # Back-and-forth rounds
    for round_num in range(1, num_rounds + 1):
        for i, agent in enumerate(agents):
            opponent_args = [
                t for t in transcript
                if t["agent"] != agent.name and t["round"] == round_num - 1
            ]
            opponent_text = "\n\n".join(
                f"{t['agent']} said: {t['argument']}" for t in opponent_args
            )

            prompt = (
                f"Round {round_num}. Your opponents argued:\n\n{opponent_text}\n\n"
                f"Respond to their points and strengthen your position."
            )
            reply = agent.argue(prompt)
            transcript.append({
                "round": round_num,
                "agent": agent.name,
                "argument": reply,
            })
            print(f"[Round {round_num}] {agent.name}:\n{reply}\n{'─' * 40}")

    return transcript

Each agent sees what the others said in the previous round. This creates genuine back-and-forth where agents counter specific claims rather than just restating their position. Two or three rounds is the sweet spot – more than that and arguments start going in circles.

Adding a Judge Agent

The judge reads the full transcript and scores each agent on specific criteria. Structured output makes the verdict easy to parse:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import json


def judge_debate(question: str, transcript: list[dict]) -> dict:
    formatted_transcript = ""
    for entry in transcript:
        formatted_transcript += (
            f"[Round {entry['round']}] {entry['agent']}:\n"
            f"{entry['argument']}\n\n"
        )

    judge_prompt = f"""You are an impartial debate judge. Evaluate the following debate.

Question: {question}

Transcript:
{formatted_transcript}

Score each participant on these criteria (1-10 each):
- Logic: How sound is their reasoning?
- Evidence: Do they cite specific facts or examples?
- Rebuttal: How well do they address opponent arguments?
- Persuasiveness: Overall strength of their position?

Return your evaluation as JSON with this exact structure:
{{
  "scores": {{
    "<agent_name>": {{"logic": N, "evidence": N, "rebuttal": N, "persuasiveness": N, "total": N}},
    ...
  }},
  "winner": "<agent_name>",
  "reasoning": "<2-3 sentence explanation>"
}}

Return ONLY valid JSON, no other text."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a fair and analytical debate judge. Always respond with valid JSON."},
            {"role": "user", "content": judge_prompt},
        ],
        temperature=0.3,
        max_tokens=1000,
    )

    raw = response.choices[0].message.content.strip()
    # Handle markdown code fences the model sometimes adds
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()

    return json.loads(raw)

The judge runs at lower temperature (0.3) for more consistent scoring. Asking for JSON directly works well with GPT-4o, but the code fence stripping handles cases where the model wraps the output in markdown.

Full Debate Pipeline

Here is everything wired together into a single function you can call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def run_debate(question: str, rounds: int = 3) -> dict:
    agents = [
        DebateAgent(name="Advocate", stance=f"Argue IN FAVOR of the most common answer to: {question}"),
        DebateAgent(name="Critic", stance=f"Argue AGAINST the most common answer to: {question}. Present alternatives."),
        DebateAgent(name="Pragmatist", stance=f"Argue for the most PRACTICAL middle-ground answer to: {question}"),
    ]

    print(f"Debate: {question}")
    print(f"Agents: {', '.join(a.name for a in agents)}")
    print(f"Rounds: {rounds}")
    print("=" * 50)

    transcript = debate_round(agents, question, num_rounds=rounds)
    verdict = judge_debate(question, transcript)

    print("\n" + "=" * 50)
    print("VERDICT")
    print("=" * 50)
    for agent_name, scores in verdict["scores"].items():
        print(f"{agent_name}: {scores}")
    print(f"\nWinner: {verdict['winner']}")
    print(f"Reasoning: {verdict['reasoning']}")

    return {
        "question": question,
        "transcript": transcript,
        "verdict": verdict,
    }


result = run_debate(
    "Should teams adopt trunk-based development over GitFlow?",
    rounds=2,
)

Three agents works better than two. The Pragmatist prevents the debate from becoming a binary shouting match and often wins by synthesizing the strongest points from both sides. That is the whole point – you get a more nuanced answer than any single prompt could produce.

You can extend this further. Add domain-specific agents (a security expert, a performance engineer) for technical questions. Swap in different models per agent – use GPT-4o for the judge but GPT-4o-mini for debaters to cut costs. Run multiple debates in parallel with asyncio and aggregate the results.

Common Errors and Fixes

openai.RateLimitError: Rate limit reached for gpt-4o

Multi-agent debates hit the API hard. Three agents over three rounds is at least nine API calls, plus the judge. Add retry logic with exponential backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import time
from openai import RateLimitError

def argue_with_retry(agent: DebateAgent, prompt: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return agent.argue(prompt)
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed after {max_retries} retries")

json.decoder.JSONDecodeError: Expecting value: line 1 column 1

The judge sometimes returns preamble text before the JSON. The code fence stripping above handles most cases, but if you still hit this, use a more aggressive extraction:

1
2
3
4
5
6
7
import re

def extract_json(text: str) -> dict:
    match = re.search(r"\{.*\}", text, re.DOTALL)
    if match:
        return json.loads(match.group())
    raise ValueError(f"No JSON found in response: {text[:200]}")

openai.BadRequestError: This model's maximum context length is 128000 tokens

Long debates can blow past context limits. Each agent carries its full history, so three rounds with three agents accumulates fast. Two fixes: trim older rounds from agent history to keep only the last two, or switch to GPT-4o-mini for debaters since it is cheaper and handles 128k tokens at lower cost.

1
2
3
4
5
def trim_history(agent: DebateAgent, keep_rounds: int = 2):
    """Keep only system prompt and last N exchanges."""
    system = agent.history[0]
    recent = agent.history[-(keep_rounds * 2):]  # Each exchange is 2 messages
    agent.history = [system] + recent