Most API tests start the same way: you read a spec, write a request, check the status code, eyeball the response body, repeat fifty times. An LLM can do that loop for you. Give it tool access to make HTTP requests, describe what your API should do, and let it generate and execute tests autonomously.
Here’s the core idea in code – an agent that has tools for GET, POST, PUT, and DELETE, wired up through OpenAI’s function calling:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
| import json
import requests
from openai import OpenAI
client = OpenAI()
# Tool schemas for HTTP methods
tools = [
{
"type": "function",
"function": {
"name": "http_get",
"description": "Send an HTTP GET request to a URL and return the status code and response body.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to send the GET request to"},
"headers": {
"type": "object",
"description": "Optional HTTP headers as key-value pairs",
"additionalProperties": {"type": "string"},
},
},
"required": ["url"],
},
},
},
{
"type": "function",
"function": {
"name": "http_post",
"description": "Send an HTTP POST request with a JSON body and return the status code and response.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to send the POST request to"},
"body": {"type": "object", "description": "JSON body to send with the request"},
"headers": {
"type": "object",
"description": "Optional HTTP headers as key-value pairs",
"additionalProperties": {"type": "string"},
},
},
"required": ["url", "body"],
},
},
},
{
"type": "function",
"function": {
"name": "http_put",
"description": "Send an HTTP PUT request with a JSON body and return the status code and response.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to send the PUT request to"},
"body": {"type": "object", "description": "JSON body to send with the request"},
"headers": {
"type": "object",
"description": "Optional HTTP headers as key-value pairs",
"additionalProperties": {"type": "string"},
},
},
"required": ["url", "body"],
},
},
},
{
"type": "function",
"function": {
"name": "http_delete",
"description": "Send an HTTP DELETE request and return the status code and response body.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to send the DELETE request to"},
"headers": {
"type": "object",
"description": "Optional HTTP headers as key-value pairs",
"additionalProperties": {"type": "string"},
},
},
"required": ["url"],
},
},
},
]
|
Each tool maps to an HTTP method. The agent decides which method to call, what URL to hit, and what body to send – all based on the API description you give it.
Each tool function wraps requests and returns a structured result the LLM can reason about. Keep the timeout short so a broken endpoint doesn’t hang your agent.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| def http_get(url: str, headers: dict | None = None) -> str:
try:
resp = requests.get(url, headers=headers or {}, timeout=10)
return json.dumps({
"status_code": resp.status_code,
"headers": dict(resp.headers),
"body": resp.text[:2000], # truncate large responses
})
except requests.RequestException as e:
return json.dumps({"error": str(e)})
def http_post(url: str, body: dict, headers: dict | None = None) -> str:
try:
resp = requests.post(url, json=body, headers=headers or {}, timeout=10)
return json.dumps({
"status_code": resp.status_code,
"headers": dict(resp.headers),
"body": resp.text[:2000],
})
except requests.RequestException as e:
return json.dumps({"error": str(e)})
def http_put(url: str, body: dict, headers: dict | None = None) -> str:
try:
resp = requests.put(url, json=body, headers=headers or {}, timeout=10)
return json.dumps({
"status_code": resp.status_code,
"headers": dict(resp.headers),
"body": resp.text[:2000],
})
except requests.RequestException as e:
return json.dumps({"error": str(e)})
def http_delete(url: str, headers: dict | None = None) -> str:
try:
resp = requests.delete(url, headers=headers or {}, timeout=10)
return json.dumps({
"status_code": resp.status_code,
"headers": dict(resp.headers),
"body": resp.text[:2000],
})
except requests.RequestException as e:
return json.dumps({"error": str(e)})
TOOL_MAP = {
"http_get": http_get,
"http_post": http_post,
"http_put": http_put,
"http_delete": http_delete,
}
|
Response bodies are truncated to 2000 characters. LLMs choke on massive payloads, and you rarely need the full body to validate a test case. If you’re testing endpoints that return large responses, bump that limit or extract just the fields you care about.
The Agent Loop#
This is the same tool-calling loop used in any function-calling agent, adapted for API testing. The system prompt tells the model to act as a QA engineer – it generates test cases, executes them, and reports pass/fail results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| SYSTEM_PROMPT = """You are an API testing agent. Given an API description, you:
1. Generate test cases covering happy paths, edge cases, and error conditions.
2. Execute each test by calling the appropriate HTTP tool.
3. Validate the response status code and body against expected behavior.
4. Report results as PASS or FAIL with a short explanation.
Rules:
- Test one endpoint per tool call.
- Always check status codes. A 200 for a POST that should return 201 is a FAIL.
- Test invalid inputs too: missing fields, wrong types, non-existent IDs.
- After all tests, provide a summary with pass/fail counts.
"""
def run_api_test_agent(api_description: str, max_iterations: int = 20) -> str:
"""Run the API testing agent on an API description."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": api_description},
]
for i in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
temperature=0,
)
choice = response.choices[0]
message = choice.message
# Append the assistant message to conversation history
messages.append(message)
# If no tool calls, the agent is done
if not message.tool_calls:
return message.content
# Execute each tool call
for tool_call in message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f" [{tool_call.id}] {func_name}({json.dumps(func_args, indent=2)[:200]})")
if func_name in TOOL_MAP:
result = TOOL_MAP[func_name](**func_args)
else:
result = json.dumps({"error": f"Unknown tool: {func_name}"})
print(f" -> {result[:150]}")
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
})
return "Agent reached maximum iterations."
|
A few things to note. The temperature=0 keeps test execution deterministic. Tool results go back as role: "tool" messages with the matching tool_call_id – this is how OpenAI’s API knows which result corresponds to which call. The agent can make multiple tool calls in a single turn if the model requests parallel function calls.
Running Tests Against a Real API#
Point the agent at JSONPlaceholder (a free fake REST API) to see it work end-to-end:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| result = run_api_test_agent("""
Test the JSONPlaceholder API at https://jsonplaceholder.typicode.com
Endpoints to test:
- GET /posts - returns a list of posts (expect 100 items)
- GET /posts/1 - returns a single post with id, title, body, userId
- POST /posts - creates a new post (send title, body, userId; expect 201)
- PUT /posts/1 - updates post 1 (send title, body, userId; expect 200)
- DELETE /posts/1 - deletes post 1 (expect 200)
Also test error cases:
- GET /posts/99999 - non-existent post (expect 404)
- POST /posts with empty body (expect 201 -- this API accepts anything)
""")
print(result)
|
The agent generates and runs each test, then produces a summary like:
1
2
3
4
5
6
7
8
9
10
11
| ## Test Results
1. PASS - GET /posts returned 200 with 100 items
2. PASS - GET /posts/1 returned 200 with correct fields (id, title, body, userId)
3. PASS - POST /posts returned 201 with generated id 101
4. PASS - PUT /posts/1 returned 200 with updated fields
5. PASS - DELETE /posts/1 returned 200
6. FAIL - GET /posts/99999 returned 200 instead of expected 404 (API returns empty object)
7. PASS - POST /posts with empty body returned 201
Summary: 6/7 passed, 1/7 failed
|
That failing test is actually correct behavior from the agent – JSONPlaceholder returns 200 with an empty object for non-existent resources instead of 404. The agent caught a real API quirk.
Parsing OpenAPI Specs for Auto-Generated Tests#
You can go further by feeding the agent an OpenAPI spec and letting it generate tests automatically. Here’s a helper that extracts endpoint info from an OpenAPI JSON:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| def extract_endpoints_from_openapi(spec_path: str) -> str:
"""Parse an OpenAPI spec and format it as a test prompt for the agent."""
with open(spec_path, "r") as f:
spec = json.load(f)
base_url = spec.get("servers", [{}])[0].get("url", "http://localhost:8000")
endpoints = []
for path, methods in spec.get("paths", {}).items():
for method, details in methods.items():
if method in ("get", "post", "put", "delete"):
summary = details.get("summary", "No description")
request_body = details.get("requestBody", {})
responses = details.get("responses", {})
endpoint_info = f"- {method.upper()} {path}: {summary}"
# Extract expected response codes
expected_codes = list(responses.keys())
endpoint_info += f" (expected status codes: {', '.join(expected_codes)})"
# Extract request body schema if present
if request_body:
content = request_body.get("content", {})
json_schema = content.get("application/json", {}).get("schema", {})
if json_schema:
endpoint_info += f"\n Request body schema: {json.dumps(json_schema)}"
endpoints.append(endpoint_info)
prompt = f"""Test the API at {base_url}
Endpoints:
{chr(10).join(endpoints)}
Generate and execute tests for each endpoint. Include happy path tests,
validation error tests, and edge cases (empty bodies, wrong types, missing fields).
"""
return prompt
# Usage:
# prompt = extract_endpoints_from_openapi("openapi.json")
# result = run_api_test_agent(prompt)
|
This takes a standard OpenAPI 3.x JSON file, pulls out the endpoints with their expected status codes and request schemas, and formats everything into a prompt the agent can act on. The agent figures out what payloads to send and what assertions to make.
Common Errors and Fixes#
requests.exceptions.ConnectionError flooding your results. Your target API is down or the URL is wrong. Add a pre-flight check before running the agent:
1
2
3
4
5
6
7
8
9
10
| def check_api_health(base_url: str) -> bool:
"""Verify the API is reachable before running tests."""
try:
resp = requests.get(base_url, timeout=5)
return resp.status_code < 500
except requests.RequestException:
return False
if not check_api_health("https://jsonplaceholder.typicode.com"):
print("API is unreachable. Skipping tests.")
|
Agent tries to call tools that don’t exist. This happens when the model hallucinates tool names like http_patch or validate_response. The TOOL_MAP lookup in the agent loop already handles this by returning an error message, but you can also restrict the model by adding to the system prompt: “You can ONLY use these tools: http_get, http_post, http_put, http_delete.”
json.JSONDecodeError when parsing function.arguments. OpenAI occasionally returns malformed JSON in tool call arguments, especially with complex nested bodies. Wrap the parse in a try/except:
1
2
3
4
5
6
7
8
9
| try:
func_args = json.loads(tool_call.function.arguments)
except json.JSONDecodeError:
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps({"error": "Failed to parse tool arguments"}),
})
continue
|
Timeout errors on slow endpoints. The 10-second timeout in the tool functions is aggressive for some APIs. Bump it to 30 seconds for integration test suites, or make it configurable:
1
2
3
4
5
6
| def http_get(url: str, headers: dict | None = None, timeout: int = 10) -> str:
try:
resp = requests.get(url, headers=headers or {}, timeout=timeout)
return json.dumps({"status_code": resp.status_code, "body": resp.text[:2000]})
except requests.RequestException as e:
return json.dumps({"error": str(e)})
|
Agent runs forever without producing a summary. Set max_iterations to something reasonable (20 is a good default for a typical API with 5-10 endpoints) and add a fallback message. If your API has dozens of endpoints, break them into groups and run the agent once per group.
Rate limiting from the OpenAI API. Each tool call round-trip costs an API call. For a thorough test run, you might hit 15-30 iterations. Use gpt-4o-mini for cheaper test runs during development, then switch to gpt-4o for final validation. The tool-calling accuracy is comparable for straightforward HTTP tests.