Writing tests is tedious. Fixing broken tests is worse. An LLM can do both – read your source code, generate pytest tests, run them, read the failures, and fix the tests in a loop until they pass. You end up with a testing agent that takes a Python file as input and hands back a working test suite.
Here’s what the usage looks like:
1
| pip install openai pytest
|
1
2
3
4
5
6
| from testing_agent import TestingAgent
agent = TestingAgent()
test_code = agent.run("src/calculator.py")
print(test_code)
# Writes and runs tests, fixes failures automatically, returns passing test code
|
The agent follows a tight loop: read source code, generate tests, run pytest, check results, fix failures, repeat. Let’s build it.
The Agent Loop Architecture#
The core class manages the full cycle. It reads your source file, asks the LLM to write tests, runs them with pytest, and enters a fix loop if anything fails.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
| import os
import json
import subprocess
import tempfile
from openai import OpenAI
client = OpenAI()
class TestingAgent:
def __init__(self, model: str = "gpt-4o", max_retries: int = 5):
self.model = model
self.max_retries = max_retries
def run(self, source_path: str) -> str:
"""Generate and validate tests for a source file."""
source_code = self._read_file(source_path)
module_name = os.path.splitext(os.path.basename(source_path))[0]
# Step 1: Generate initial tests
test_code = self._generate_tests(source_code, module_name)
# Step 2: Run and fix in a loop
for attempt in range(self.max_retries):
result = self._run_tests(test_code, source_path)
if result["passed"]:
print(f"All tests passed on attempt {attempt + 1}")
return test_code
print(f"Attempt {attempt + 1}: {result['summary']}")
test_code = self._fix_tests(
source_code, test_code, result["output"], module_name
)
print("Max retries reached. Returning best effort.")
return test_code
def _read_file(self, path: str) -> str:
with open(path, "r") as f:
return f.read()
def _run_tests(self, test_code: str, source_path: str) -> dict:
"""Write test code to a temp file and run pytest."""
source_dir = os.path.dirname(os.path.abspath(source_path))
with tempfile.NamedTemporaryFile(
mode="w", suffix="_test.py", dir=source_dir, delete=False
) as f:
f.write(test_code)
test_file = f.name
try:
proc = subprocess.run(
["python", "-m", "pytest", test_file, "-v", "--tb=short"],
capture_output=True,
text=True,
timeout=30,
cwd=source_dir,
)
output = proc.stdout + proc.stderr
passed = proc.returncode == 0
return {
"passed": passed,
"output": output,
"summary": f"{'PASSED' if passed else 'FAILED'} (exit code {proc.returncode})",
}
except subprocess.TimeoutExpired:
return {
"passed": False,
"output": "Tests timed out after 30 seconds",
"summary": "TIMEOUT",
}
finally:
os.unlink(test_file)
def _generate_tests(self, source_code: str, module_name: str) -> str:
"""Implemented in the next section."""
raise NotImplementedError
def _fix_tests(self, source_code, test_code, error_output, module_name) -> str:
"""Implemented in the self-healing section."""
raise NotImplementedError
|
The _run_tests method writes the generated test code to a temporary file in the same directory as the source, runs pytest against it, and cleans up the temp file afterward. This keeps things isolated – your project stays untouched until the tests actually pass.
Generating Tests with LLMs#
The prompt matters more than you’d think. A vague “write tests for this code” produces garbage. You need to tell the model exactly what you want: real pytest functions, correct imports, edge cases, and the module import path.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| def _generate_tests(self, source_code: str, module_name: str) -> str:
"""Ask the LLM to generate pytest tests for the source code."""
prompt = f"""Write pytest tests for the following Python module.
Module name: {module_name}
Import the module with: from {module_name} import *
Requirements:
- Use pytest, not unittest
- Test every public function and class
- Include edge cases: empty inputs, None values, type errors where appropriate
- Each test function name should describe what it tests
- Use assert statements, not self.assertEqual
- Do not use any mocking unless the code makes network or filesystem calls
- Return ONLY the Python test code, no markdown fences or explanations
Source code:
{source_code}"""
response = client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": (
"You are a senior test engineer. Write thorough, runnable "
"pytest tests. Return only valid Python code."
),
},
{"role": "user", "content": prompt},
],
temperature=0,
)
raw = response.choices[0].message.content
return self._clean_code_response(raw)
def _clean_code_response(self, raw: str) -> str:
"""Strip markdown fences if the model wraps the code."""
lines = raw.strip().splitlines()
if lines and lines[0].startswith("```"):
lines = lines[1:]
if lines and lines[-1].startswith("```"):
lines = lines[:-1]
return "\n".join(lines)
|
The _clean_code_response method handles the common case where the LLM wraps its output in markdown code fences despite being told not to. Models do this about half the time regardless of your instructions.
The key details in the prompt: specifying the exact import path (from {module_name} import *), requiring pytest over unittest, and banning unnecessary mocks. Without these, you get tests that import wrong modules or mock everything into meaninglessness.
Running and Analyzing Test Results#
The _run_tests method already handles execution. But raw pytest output is noisy. For the fix loop to work well, you want structured data about what failed and why.
Here’s a more detailed version that parses pytest’s output into something the LLM can act on:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
| def _run_tests_with_details(self, test_code: str, source_path: str) -> dict:
"""Run tests and extract structured failure information."""
source_dir = os.path.dirname(os.path.abspath(source_path))
with tempfile.NamedTemporaryFile(
mode="w", suffix="_test.py", dir=source_dir, delete=False
) as f:
f.write(test_code)
test_file = f.name
try:
# Use --tb=long for detailed tracebacks on failure
proc = subprocess.run(
[
"python", "-m", "pytest", test_file,
"-v", "--tb=long", "--no-header",
],
capture_output=True,
text=True,
timeout=30,
cwd=source_dir,
)
output = proc.stdout + proc.stderr
passed = proc.returncode == 0
# Count pass/fail from pytest's output
pass_count = output.count(" PASSED")
fail_count = output.count(" FAILED")
error_count = output.count(" ERROR")
return {
"passed": passed,
"output": output,
"pass_count": pass_count,
"fail_count": fail_count,
"error_count": error_count,
"summary": (
f"{pass_count} passed, {fail_count} failed, "
f"{error_count} errors"
),
}
except subprocess.TimeoutExpired:
return {
"passed": False,
"output": "Tests timed out after 30 seconds",
"pass_count": 0,
"fail_count": 0,
"error_count": 1,
"summary": "TIMEOUT",
}
finally:
os.unlink(test_file)
|
The pass/fail counts aren’t strictly necessary, but they give you a quick metric to decide whether the fix loop is making progress. If the fail count goes up between retries, something is going sideways.
The Self-Healing Loop#
This is the interesting part. When tests fail, you send the LLM three things: the original source code, the broken test code, and the pytest error output. It returns a corrected version of the tests.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| def _fix_tests(
self,
source_code: str,
test_code: str,
error_output: str,
module_name: str,
) -> str:
"""Ask the LLM to fix failing tests based on error output."""
prompt = f"""The following pytest tests are failing. Fix them.
Original source module ({module_name}):
{source_code}
Current test code:
{test_code}
Pytest output (failures and errors):
{error_output}
Rules:
- Fix ONLY the test code. Do not suggest changes to the source module.
- If a test has a wrong expected value, read the source to determine the correct one.
- If an import is wrong, fix the import.
- If a test is fundamentally testing the wrong behavior, rewrite it.
- Keep all passing tests unchanged.
- Return ONLY the complete corrected test file as Python code, no markdown fences.
"""
response = client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": (
"You are a test debugging expert. Fix the failing tests "
"so they pass against the given source code. "
"Return only valid Python code for the complete test file."
),
},
{"role": "user", "content": prompt},
],
temperature=0,
)
raw = response.choices[0].message.content
return self._clean_code_response(raw)
|
The fix prompt is explicit about keeping passing tests intact. Without that instruction, the LLM sometimes rewrites the entire file and introduces new failures while fixing old ones.
Here’s the complete run method with progress tracking:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| def run(self, source_path: str) -> str:
"""Generate and validate tests for a source file."""
source_code = self._read_file(source_path)
module_name = os.path.splitext(os.path.basename(source_path))[0]
print(f"Generating tests for {source_path}...")
test_code = self._generate_tests(source_code, module_name)
for attempt in range(self.max_retries):
print(f"\nAttempt {attempt + 1}/{self.max_retries}: running tests...")
result = self._run_tests(test_code, source_path)
if result["passed"]:
print(f"All tests passed on attempt {attempt + 1}.")
return test_code
print(f" Result: {result['summary']}")
print(f" Asking LLM to fix failures...")
test_code = self._fix_tests(
source_code, test_code, result["output"], module_name
)
print(f"\nMax retries ({self.max_retries}) reached.")
return test_code
|
A max retry limit of 5 works well in practice. Most tests that can be fixed are fixed within 2-3 attempts. If the agent can’t get them passing in 5 tries, there’s usually a deeper problem – a missing dependency, a side effect the LLM can’t reason about, or a genuinely buggy source module.
Saving the Output#
Once the agent produces passing tests, write them to disk:
1
2
3
4
5
6
7
| agent = TestingAgent(max_retries=5)
test_code = agent.run("src/calculator.py")
output_path = "tests/test_calculator.py"
with open(output_path, "w") as f:
f.write(test_code)
print(f"Tests written to {output_path}")
|
Common Errors and Fixes#
ModuleNotFoundError: No module named 'calculator'
The temp test file can’t find your source module. This happens when pytest runs from a different directory than expected. Make sure the source directory is on sys.path:
1
| ModuleNotFoundError: No module named 'calculator'
|
Fix it by adding the source directory to the test code’s imports:
1
2
3
| # Add this before generating tests -- prepend to the test code
import sys
sys.path.insert(0, os.path.dirname(os.path.abspath(source_path)))
|
Or pass cwd to the subprocess call so pytest runs in the source directory, which is what the agent code above already does.
SyntaxError: invalid syntax in generated test code
The LLM returned code wrapped in markdown fences (```python ... ```), and the _clean_code_response method didn’t strip them. Check the raw output:
1
2
3
4
| File "/tmp/tmpXXXXXX_test.py", line 1
```python
^
SyntaxError: invalid syntax
|
The _clean_code_response handles this, but sometimes the model nests fences or uses indented fences. Make the stripping more aggressive:
1
2
3
4
5
6
7
8
9
10
11
| def _clean_code_response(self, raw: str) -> str:
"""Strip markdown fences from LLM output."""
raw = raw.strip()
# Remove opening fence with optional language tag
if raw.startswith("```"):
first_newline = raw.index("\n")
raw = raw[first_newline + 1:]
# Remove closing fence
if raw.rstrip().endswith("```"):
raw = raw.rstrip()[:-3].rstrip()
return raw
|
subprocess.TimeoutExpired on every test run
One of the generated tests has an infinite loop or is waiting on user input. The 30-second timeout catches this, but the agent will burn all its retries trying to fix a test that hangs. Add explicit instructions to the fix prompt:
1
| Tests timed out after 30 seconds
|
Handle it by telling the LLM about the timeout:
1
2
3
4
5
6
7
| if "timed out" in result["output"]:
error_output = (
result["output"]
+ "\n\nIMPORTANT: Tests timed out. One of the tests likely has "
"an infinite loop, blocking I/O call, or input() call. "
"Remove or fix the offending test."
)
|
This gives the LLM enough context to identify and remove the problematic test on the next attempt.