A stack trace tells you where something broke. An LLM can tell you why. Wire the two together with tool calling, and you get an agent that reads your traceback, pulls up the relevant source files, and hands you a fix.
Here’s how to build one from scratch.
The Agent Architecture#
The debugging agent follows a simple loop:
- Receive a Python stack trace as input
- Parse filenames and line numbers from the traceback
- Use tools to read those source files
- Optionally search the codebase for related definitions
- Return a diagnosis with a concrete fix
The LLM drives the loop. You give it tools — read_file, search_code, run_test — and it decides which ones to call based on what it sees in the stack trace.
Each tool is a plain Python function. The agent calls them through OpenAI’s tools parameter, which replaced the old functions API.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
| import os
import subprocess
import json
from openai import OpenAI
client = OpenAI()
# Tool implementations
def read_file(filepath: str, start_line: int = 1, end_line: int = 50) -> str:
"""Read lines from a file. Returns numbered lines."""
try:
abs_path = os.path.abspath(filepath)
with open(abs_path, "r") as f:
lines = f.readlines()
selected = lines[max(0, start_line - 1):end_line]
numbered = [
f"{i + start_line}: {line.rstrip()}"
for i, line in enumerate(selected)
]
return "\n".join(numbered)
except FileNotFoundError:
return f"Error: File not found: {filepath}"
except Exception as e:
return f"Error reading file: {e}"
def search_code(pattern: str, directory: str = ".") -> str:
"""Search for a pattern in the codebase using grep."""
try:
result = subprocess.run(
["grep", "-rn", "--include=*.py", pattern, directory],
capture_output=True,
text=True,
timeout=10,
)
output = result.stdout.strip()
if not output:
return f"No matches found for pattern: {pattern}"
# Limit output to first 30 lines to stay within token budget
lines = output.split("\n")[:30]
return "\n".join(lines)
except subprocess.TimeoutExpired:
return "Error: Search timed out"
def run_test(command: str) -> str:
"""Run a test command and return the output."""
allowed_prefixes = ["python -m pytest", "python -m unittest", "python -c"]
if not any(command.startswith(prefix) for prefix in allowed_prefixes):
return "Error: Only pytest, unittest, and python -c commands are allowed"
try:
result = subprocess.run(
command.split(),
capture_output=True,
text=True,
timeout=30,
)
output = result.stdout + result.stderr
# Truncate long output
if len(output) > 3000:
output = output[:3000] + "\n... (truncated)"
return output
except subprocess.TimeoutExpired:
return "Error: Test command timed out after 30 seconds"
# Map function names to implementations
TOOL_FUNCTIONS = {
"read_file": read_file,
"search_code": search_code,
"run_test": run_test,
}
|
Each function returns a string. Errors come back as strings too — the LLM handles them gracefully and can retry or adjust its approach.
OpenAI needs JSON schemas describing each tool. These go in the tools parameter on every API call.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
| TOOLS = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read lines from a source file. Use this to inspect code at locations mentioned in stack traces.",
"parameters": {
"type": "object",
"properties": {
"filepath": {
"type": "string",
"description": "Path to the file to read",
},
"start_line": {
"type": "integer",
"description": "First line to read (1-indexed). Default 1.",
},
"end_line": {
"type": "integer",
"description": "Last line to read (inclusive). Default 50.",
},
},
"required": ["filepath"],
},
},
},
{
"type": "function",
"function": {
"name": "search_code",
"description": "Search the codebase for a pattern using grep. Returns matching lines with file paths and line numbers.",
"parameters": {
"type": "object",
"properties": {
"pattern": {
"type": "string",
"description": "The text or regex pattern to search for",
},
"directory": {
"type": "string",
"description": "Directory to search in. Default is current directory.",
},
},
"required": ["pattern"],
},
},
},
{
"type": "function",
"function": {
"name": "run_test",
"description": "Run a test command to verify a fix. Only pytest, unittest, and python -c commands are allowed.",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The test command to run, e.g. 'python -m pytest tests/test_auth.py::test_login -x'",
},
},
"required": ["command"],
},
},
},
]
|
Notice the descriptions are specific. The read_file description mentions stack traces on purpose — it nudges the model toward the right usage pattern.
The Agent Loop#
This is where everything connects. The loop sends messages to the API, checks for tool calls, executes them, feeds the results back, and repeats until the model gives a final text response.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| def run_debugging_agent(stack_trace: str, max_iterations: int = 10) -> str:
"""Run the debugging agent on a stack trace. Returns the diagnosis."""
system_prompt = """You are a debugging agent. You analyze Python stack traces,
read the relevant source code, and diagnose the root cause.
Your workflow:
1. Parse the stack trace to identify files, line numbers, and the exception type
2. Use read_file to examine the code at each frame in the traceback
3. Use search_code if you need to find class definitions, imports, or related code
4. Use run_test if you want to verify your hypothesis
5. Provide a clear diagnosis with:
- Root cause explanation
- The exact fix (show the corrected code)
- Why the fix works
Be precise. Point to specific lines. Show the corrected code in a diff-like format."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Debug this stack trace:\n\n```\n{stack_trace}\n```"},
]
for iteration in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
message = response.choices[0].message
messages.append(message)
# If no tool calls, the agent is done — return the final answer
if not message.tool_calls:
return message.content
# Execute each tool call and feed results back
for tool_call in message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f" [{iteration + 1}] Calling {fn_name}({fn_args})")
fn = TOOL_FUNCTIONS.get(fn_name)
if fn is None:
result = f"Error: Unknown tool '{fn_name}'"
else:
result = fn(**fn_args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
})
return "Agent reached maximum iterations without a final answer."
|
Key details here:
tool_choice="auto" lets the model decide when to call tools and when to respond with text. You don’t need to force it.- Each tool result uses
role: "tool" with the matching tool_call_id. This is required — the API rejects mismatched IDs. - The loop exits when the model sends a message with no
tool_calls. That message contains the diagnosis.
Running It#
Feed the agent a real stack trace and watch it work:
1
2
3
4
5
6
7
8
9
10
11
12
13
| sample_trace = """Traceback (most recent call last):
File "/app/server.py", line 45, in handle_request
user = authenticate(request.headers["Authorization"])
File "/app/auth.py", line 12, in authenticate
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
File "/usr/lib/python3.11/site-packages/jwt/api_jwt.py", line 210, in decode
decoded = self.decode_complete(jwt, key, algorithms, options, audience,
File "/usr/lib/python3.11/site-packages/jwt/api_jwt.py", line 151, in decode_complete
payload = self._validate_claims(payload, merged_options, audience=audience,
jwt.exceptions.ExpiredSignatureError: Signature has expired"""
diagnosis = run_debugging_agent(sample_trace)
print(diagnosis)
|
The agent will typically:
- Call
read_file("/app/auth.py", 1, 30) to see the authenticate function - Call
read_file("/app/server.py", 40, 55) to see the calling code - Maybe
search_code("SECRET_KEY") to find how the key is configured - Return a diagnosis explaining that the JWT token is expired and suggesting you either refresh the token or add
options={"verify_exp": False} for debugging
GPT-4o sometimes issues multiple tool calls in a single response — for example, reading two files at once. The loop already handles this because it iterates over message.tool_calls. But you can speed things up with concurrent execution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| from concurrent.futures import ThreadPoolExecutor
def execute_tool_calls(tool_calls):
"""Execute multiple tool calls in parallel."""
results = []
def run_one(tc):
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
fn = TOOL_FUNCTIONS.get(fn_name)
if fn is None:
return tc.id, f"Error: Unknown tool '{fn_name}'"
return tc.id, fn(**fn_args)
with ThreadPoolExecutor(max_workers=4) as pool:
futures = [pool.submit(run_one, tc) for tc in tool_calls]
for future in futures:
call_id, result = future.result()
results.append({
"role": "tool",
"tool_call_id": call_id,
"content": result,
})
return results
|
Replace the inner for tool_call in message.tool_calls block with a call to execute_tool_calls(message.tool_calls) and extend the messages list with the returned results.
Adding Safety Guardrails#
The run_test tool executes commands on your system. Lock it down:
- Allowlist prefixes — only
pytest, unittest, and python -c are permitted (already done above) - Timeouts — every subprocess gets a hard timeout
- Sandboxing — for production use, run the agent inside a Docker container or a firejail sandbox
- Token limits — truncate tool outputs so you don’t blow up context windows
For read_file, validate that the path stays within your project root:
1
2
3
4
5
6
7
8
9
| import os
PROJECT_ROOT = os.path.abspath("/app")
def safe_read_file(filepath: str, start_line: int = 1, end_line: int = 50) -> str:
abs_path = os.path.abspath(filepath)
if not abs_path.startswith(PROJECT_ROOT):
return f"Error: Access denied. Path must be under {PROJECT_ROOT}"
return read_file(filepath, start_line, end_line)
|
Common Errors and Fixes#
openai.BadRequestError: Missing tool_call_id — Every tool result message must include the tool_call_id from the corresponding tool call. If you’re appending tool results manually, make sure the IDs match exactly.
json.JSONDecodeError when parsing tool arguments — The model occasionally returns malformed JSON in tool_call.function.arguments. Wrap json.loads() in a try/except and return an error string to the model so it can retry.
Agent loops forever without answering — Set max_iterations and check for it. Also make sure your system prompt explicitly tells the model to provide a final text answer after gathering enough information.
TypeError: read_file() got an unexpected keyword argument — Your function signature must match the JSON schema exactly. If the schema says filepath, your function parameter must be filepath, not file_path or path.
Tool results too large, hitting token limits — Truncate tool output before appending it to messages. A 3000-character cap per tool result keeps context usage reasonable while still giving the model enough to work with.
openai.RateLimitError during the agent loop — Add exponential backoff. The simplest approach is wrapping the API call with tenacity:
1
2
3
4
5
6
7
8
9
10
| from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(min=1, max=30), stop=stop_after_attempt(5))
def call_api(messages):
return client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
|
Where to Go from Here#
Once you have the basic loop working, there are a few high-value extensions:
- Persistent context — cache file contents between sessions so the agent doesn’t re-read the same files
- Git integration — add a
git_diff tool so the agent can see what changed recently, which is often the cause of the bug - Auto-patching — add a
write_file tool and let the agent apply fixes directly, then verify with run_test - Multi-language support — the same architecture works for JavaScript, Go, or Rust stack traces. Just adjust the system prompt and file reading tools
The core pattern stays the same: give the LLM eyes into your codebase through tools, and let it reason about the error.