How to Build a Debugging Agent with Stack Trace Analysis

A stack trace tells you where something broke. An LLM can tell you why. Wire the two together with tool calling, and you get an agent that reads your traceback, pulls up the relevant source files, and hands you a fix.

Here’s how to build one from scratch.

The Agent Architecture

The debugging agent follows a simple loop:

Receive a Python stack trace as input
Parse filenames and line numbers from the traceback
Use tools to read those source files
Optionally search the codebase for related definitions
Return a diagnosis with a concrete fix

The LLM drives the loop. You give it tools — read_file, search_code, run_test — and it decides which ones to call based on what it sees in the stack trace.

Defining the Tools

Each tool is a plain Python function. The agent calls them through OpenAI’s tools parameter, which replaced the old functions API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import os
import subprocess
import json
from openai import OpenAI

client = OpenAI()

# Tool implementations
def read_file(filepath: str, start_line: int = 1, end_line: int = 50) -> str:
    """Read lines from a file. Returns numbered lines."""
    try:
        abs_path = os.path.abspath(filepath)
        with open(abs_path, "r") as f:
            lines = f.readlines()
        selected = lines[max(0, start_line - 1):end_line]
        numbered = [
            f"{i + start_line}: {line.rstrip()}"
            for i, line in enumerate(selected)
        ]
        return "\n".join(numbered)
    except FileNotFoundError:
        return f"Error: File not found: {filepath}"
    except Exception as e:
        return f"Error reading file: {e}"


def search_code(pattern: str, directory: str = ".") -> str:
    """Search for a pattern in the codebase using grep."""
    try:
        result = subprocess.run(
            ["grep", "-rn", "--include=*.py", pattern, directory],
            capture_output=True,
            text=True,
            timeout=10,
        )
        output = result.stdout.strip()
        if not output:
            return f"No matches found for pattern: {pattern}"
        # Limit output to first 30 lines to stay within token budget
        lines = output.split("\n")[:30]
        return "\n".join(lines)
    except subprocess.TimeoutExpired:
        return "Error: Search timed out"


def run_test(command: str) -> str:
    """Run a test command and return the output."""
    allowed_prefixes = ["python -m pytest", "python -m unittest", "python -c"]
    if not any(command.startswith(prefix) for prefix in allowed_prefixes):
        return "Error: Only pytest, unittest, and python -c commands are allowed"
    try:
        result = subprocess.run(
            command.split(),
            capture_output=True,
            text=True,
            timeout=30,
        )
        output = result.stdout + result.stderr
        # Truncate long output
        if len(output) > 3000:
            output = output[:3000] + "\n... (truncated)"
        return output
    except subprocess.TimeoutExpired:
        return "Error: Test command timed out after 30 seconds"


# Map function names to implementations
TOOL_FUNCTIONS = {
    "read_file": read_file,
    "search_code": search_code,
    "run_test": run_test,
}

Each function returns a string. Errors come back as strings too — the LLM handles them gracefully and can retry or adjust its approach.

Tool Schemas for the API

OpenAI needs JSON schemas describing each tool. These go in the tools parameter on every API call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read lines from a source file. Use this to inspect code at locations mentioned in stack traces.",
            "parameters": {
                "type": "object",
                "properties": {
                    "filepath": {
                        "type": "string",
                        "description": "Path to the file to read",
                    },
                    "start_line": {
                        "type": "integer",
                        "description": "First line to read (1-indexed). Default 1.",
                    },
                    "end_line": {
                        "type": "integer",
                        "description": "Last line to read (inclusive). Default 50.",
                    },
                },
                "required": ["filepath"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_code",
            "description": "Search the codebase for a pattern using grep. Returns matching lines with file paths and line numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {
                        "type": "string",
                        "description": "The text or regex pattern to search for",
                    },
                    "directory": {
                        "type": "string",
                        "description": "Directory to search in. Default is current directory.",
                    },
                },
                "required": ["pattern"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_test",
            "description": "Run a test command to verify a fix. Only pytest, unittest, and python -c commands are allowed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "The test command to run, e.g. 'python -m pytest tests/test_auth.py::test_login -x'",
                    },
                },
                "required": ["command"],
            },
        },
    },
]

Notice the descriptions are specific. The read_file description mentions stack traces on purpose — it nudges the model toward the right usage pattern.

The Agent Loop

This is where everything connects. The loop sends messages to the API, checks for tool calls, executes them, feeds the results back, and repeats until the model gives a final text response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def run_debugging_agent(stack_trace: str, max_iterations: int = 10) -> str:
    """Run the debugging agent on a stack trace. Returns the diagnosis."""
    system_prompt = """You are a debugging agent. You analyze Python stack traces,
read the relevant source code, and diagnose the root cause.

Your workflow:
1. Parse the stack trace to identify files, line numbers, and the exception type
2. Use read_file to examine the code at each frame in the traceback
3. Use search_code if you need to find class definitions, imports, or related code
4. Use run_test if you want to verify your hypothesis
5. Provide a clear diagnosis with:
   - Root cause explanation
   - The exact fix (show the corrected code)
   - Why the fix works

Be precise. Point to specific lines. Show the corrected code in a diff-like format."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Debug this stack trace:\n\n```\n{stack_trace}\n```"},
    ]

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto",
        )

        message = response.choices[0].message
        messages.append(message)

        # If no tool calls, the agent is done — return the final answer
        if not message.tool_calls:
            return message.content

        # Execute each tool call and feed results back
        for tool_call in message.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)

            print(f"  [{iteration + 1}] Calling {fn_name}({fn_args})")

            fn = TOOL_FUNCTIONS.get(fn_name)
            if fn is None:
                result = f"Error: Unknown tool '{fn_name}'"
            else:
                result = fn(**fn_args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Agent reached maximum iterations without a final answer."

Key details here:

tool_choice="auto" lets the model decide when to call tools and when to respond with text. You don’t need to force it.
Each tool result uses role: "tool" with the matching tool_call_id. This is required — the API rejects mismatched IDs.
The loop exits when the model sends a message with no tool_calls. That message contains the diagnosis.

Running It

Feed the agent a real stack trace and watch it work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sample_trace = """Traceback (most recent call last):
  File "/app/server.py", line 45, in handle_request
    user = authenticate(request.headers["Authorization"])
  File "/app/auth.py", line 12, in authenticate
    payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
  File "/usr/lib/python3.11/site-packages/jwt/api_jwt.py", line 210, in decode
    decoded = self.decode_complete(jwt, key, algorithms, options, audience,
  File "/usr/lib/python3.11/site-packages/jwt/api_jwt.py", line 151, in decode_complete
    payload = self._validate_claims(payload, merged_options, audience=audience,
jwt.exceptions.ExpiredSignatureError: Signature has expired"""

diagnosis = run_debugging_agent(sample_trace)
print(diagnosis)

The agent will typically:

Call read_file("/app/auth.py", 1, 30) to see the authenticate function
Call read_file("/app/server.py", 40, 55) to see the calling code
Maybe search_code("SECRET_KEY") to find how the key is configured
Return a diagnosis explaining that the JWT token is expired and suggesting you either refresh the token or add options={"verify_exp": False} for debugging

Handling Parallel Tool Calls

GPT-4o sometimes issues multiple tool calls in a single response — for example, reading two files at once. The loop already handles this because it iterates over message.tool_calls. But you can speed things up with concurrent execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from concurrent.futures import ThreadPoolExecutor

def execute_tool_calls(tool_calls):
    """Execute multiple tool calls in parallel."""
    results = []

    def run_one(tc):
        fn_name = tc.function.name
        fn_args = json.loads(tc.function.arguments)
        fn = TOOL_FUNCTIONS.get(fn_name)
        if fn is None:
            return tc.id, f"Error: Unknown tool '{fn_name}'"
        return tc.id, fn(**fn_args)

    with ThreadPoolExecutor(max_workers=4) as pool:
        futures = [pool.submit(run_one, tc) for tc in tool_calls]
        for future in futures:
            call_id, result = future.result()
            results.append({
                "role": "tool",
                "tool_call_id": call_id,
                "content": result,
            })

    return results

Replace the inner for tool_call in message.tool_calls block with a call to execute_tool_calls(message.tool_calls) and extend the messages list with the returned results.

Adding Safety Guardrails

The run_test tool executes commands on your system. Lock it down:

Allowlist prefixes — only pytest, unittest, and python -c are permitted (already done above)
Timeouts — every subprocess gets a hard timeout
Sandboxing — for production use, run the agent inside a Docker container or a firejail sandbox
Token limits — truncate tool outputs so you don’t blow up context windows

For read_file, validate that the path stays within your project root:

1
2
3
4
5
6
7
8
9
import os

PROJECT_ROOT = os.path.abspath("/app")

def safe_read_file(filepath: str, start_line: int = 1, end_line: int = 50) -> str:
    abs_path = os.path.abspath(filepath)
    if not abs_path.startswith(PROJECT_ROOT):
        return f"Error: Access denied. Path must be under {PROJECT_ROOT}"
    return read_file(filepath, start_line, end_line)

Common Errors and Fixes

openai.BadRequestError: Missing tool_call_id — Every tool result message must include the tool_call_id from the corresponding tool call. If you’re appending tool results manually, make sure the IDs match exactly.

json.JSONDecodeError when parsing tool arguments — The model occasionally returns malformed JSON in tool_call.function.arguments. Wrap json.loads() in a try/except and return an error string to the model so it can retry.

Agent loops forever without answering — Set max_iterations and check for it. Also make sure your system prompt explicitly tells the model to provide a final text answer after gathering enough information.

TypeError: read_file() got an unexpected keyword argument — Your function signature must match the JSON schema exactly. If the schema says filepath, your function parameter must be filepath, not file_path or path.

Tool results too large, hitting token limits — Truncate tool output before appending it to messages. A 3000-character cap per tool result keeps context usage reasonable while still giving the model enough to work with.

openai.RateLimitError during the agent loop — Add exponential backoff. The simplest approach is wrapping the API call with tenacity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=30), stop=stop_after_attempt(5))
def call_api(messages):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=TOOLS,
        tool_choice="auto",
    )

Where to Go from Here

Once you have the basic loop working, there are a few high-value extensions:

Persistent context — cache file contents between sessions so the agent doesn’t re-read the same files
Git integration — add a git_diff tool so the agent can see what changed recently, which is often the cause of the bug
Auto-patching — add a write_file tool and let the agent apply fixes directly, then verify with run_test
Multi-language support — the same architecture works for JavaScript, Go, or Rust stack traces. Just adjust the system prompt and file reading tools

The core pattern stays the same: give the LLM eyes into your codebase through tools, and let it reason about the error.

The Agent Architecture#

Defining the Tools#

Tool Schemas for the API#

The Agent Loop#

Running It#

Handling Parallel Tool Calls#

Adding Safety Guardrails#

Common Errors and Fixes#

Where to Go from Here#

Related Guides#

About the Author