How to Build a Code Generation Agent with LLMs

The Quick Version

A code generation agent does three things in a loop: generate code from a prompt, execute it in a sandbox, and fix errors based on the output. The LLM writes the code, a sandboxed environment runs it safely, and the agent feeds errors back until the code works.

1
pip install openai docker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import docker
from openai import OpenAI

client = OpenAI()
docker_client = docker.from_env()

def run_code_in_sandbox(code: str, timeout: int = 30) -> dict:
    """Execute Python code in a Docker container and return output."""
    try:
        result = docker_client.containers.run(
            "python:3.12-slim",
            command=["python", "-c", code],
            remove=True,
            timeout=timeout,
            mem_limit="256m",
            network_disabled=True,
        )
        return {"success": True, "output": result.decode("utf-8")}
    except docker.errors.ContainerError as e:
        return {"success": False, "error": e.stderr.decode("utf-8")}
    except Exception as e:
        return {"success": False, "error": str(e)}

result = run_code_in_sandbox("print('Hello from sandbox!')")
print(result)
# {'success': True, 'output': 'Hello from sandbox!\n'}

The network_disabled=True and mem_limit flags prevent the generated code from making network calls or consuming excessive memory. This is the bare minimum for safe execution.

The Agent Loop

The core pattern is generate-execute-fix. The agent calls the LLM to write code, runs it, and if it fails, sends the error back to the LLM for correction. Set a max retry count to avoid infinite loops.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def code_agent(task: str, max_retries: int = 3) -> dict:
    messages = [
        {
            "role": "system",
            "content": (
                "You are a Python code generator. When asked to solve a task, "
                "respond with ONLY executable Python code. No markdown, no explanations. "
                "The code should print its results to stdout."
            ),
        },
        {"role": "user", "content": task},
    ]

    for attempt in range(max_retries + 1):
        # Generate code
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0,
        )
        code = response.choices[0].message.content.strip()

        # Strip markdown fences if the model adds them
        if code.startswith("```"):
            code = "\n".join(code.split("\n")[1:-1])

        print(f"\n--- Attempt {attempt + 1} ---")
        print(code[:200] + "..." if len(code) > 200 else code)

        # Execute in sandbox
        result = run_code_in_sandbox(code)

        if result["success"]:
            return {"code": code, "output": result["output"], "attempts": attempt + 1}

        # Feed error back for correction
        messages.append({"role": "assistant", "content": code})
        messages.append({
            "role": "user",
            "content": f"That code produced this error:\n{result['error']}\nFix the code and try again.",
        })

    return {"code": code, "error": "Max retries exceeded", "last_error": result["error"]}

# Run it
result = code_agent("Write a function that finds all prime numbers up to 1000 and print the count")
print(f"\nOutput: {result['output']}")
print(f"Solved in {result['attempts']} attempt(s)")

Most tasks solve on the first attempt. The retry loop catches edge cases like missing imports, syntax errors, or wrong assumptions about data formats. In practice, 3 retries is enough — if the model can’t fix it in 3 tries, the prompt needs to be clearer.

Adding Tool Use for Complex Tasks

For agents that need to do more than run code — like read files, query databases, or call APIs — use the LLM’s function calling to let it choose between tools.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "execute_python",
            "description": "Execute Python code in a sandboxed environment",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string", "description": "Python code to execute"},
                },
                "required": ["code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path to read"},
                },
                "required": ["path"],
            },
        },
    },
]

def handle_tool_call(tool_call) -> str:
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    if name == "execute_python":
        result = run_code_in_sandbox(args["code"])
        return result.get("output", result.get("error", ""))
    elif name == "read_file":
        try:
            return open(args["path"]).read()[:5000]  # limit file size
        except FileNotFoundError:
            return f"File not found: {args['path']}"
    return "Unknown tool"

def agent_with_tools(task: str) -> str:
    messages = [
        {"role": "system", "content": "You solve coding tasks. Use the tools available to write and test code."},
        {"role": "user", "content": task},
    ]

    for _ in range(10):  # max 10 tool calls
        response = client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=tools
        )
        msg = response.choices[0].message
        messages.append(msg)

        if msg.tool_calls:
            for tc in msg.tool_calls:
                result = handle_tool_call(tc)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": result,
                })
        else:
            return msg.content

    return "Agent hit tool call limit"

Security: Sandboxing Done Right

Running LLM-generated code is inherently risky. Docker gives you process isolation, but you need to lock it down further.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def secure_sandbox(code: str, timeout: int = 30) -> dict:
    """Production-grade sandboxed execution."""
    try:
        result = docker_client.containers.run(
            "python:3.12-slim",
            command=["python", "-c", code],
            remove=True,
            timeout=timeout,
            mem_limit="256m",
            cpu_period=100000,
            cpu_quota=50000,         # 50% of one CPU
            network_disabled=True,    # no network access
            read_only=True,           # read-only filesystem
            tmpfs={"/tmp": "size=64m"},  # writable /tmp with size limit
            security_opt=["no-new-privileges"],
            pids_limit=50,            # limit process spawning
        )
        return {"success": True, "output": result.decode("utf-8")[:10000]}
    except docker.errors.ContainerError as e:
        return {"success": False, "error": e.stderr.decode("utf-8")[:5000]}
    except Exception as e:
        return {"success": False, "error": str(e)[:1000]}

Key restrictions: no network, read-only filesystem (except /tmp), CPU and memory limits, process count limits, and output truncation. The no-new-privileges flag prevents privilege escalation inside the container.

For production, also consider using gVisor (runsc runtime) instead of the default Docker runtime for an extra layer of kernel isolation.

Common Errors and Fixes

docker.errors.ImageNotFound: python:3.12-slim

Pull the image first: docker pull python:3.12-slim. The sandbox won’t auto-pull images.

Code runs forever and times out

The timeout parameter kills the container, but you pay the full wait time. Add a Python-level timeout inside the container too:

1
code_with_timeout = f"import signal\nsignal.alarm({timeout - 5})\n{code}"

Model wraps code in markdown fences

Even with explicit instructions, models sometimes add ```python wrappers. Always strip them:

1
2
3
4
5
def clean_code(text: str) -> str:
    if text.startswith("```"):
        lines = text.split("\n")
        return "\n".join(lines[1:-1])
    return text

ImportError for third-party packages

The slim Python image only has the standard library. Build a custom image with common packages pre-installed, or install them at runtime (slower but more flexible).

Agent enters a fix loop without converging

If the model makes the same mistake repeatedly, break the loop and return the error. You can also add a “step back” prompt that asks the model to rethink its approach from scratch rather than patching the same code.

When to Use This Pattern

Code generation agents work well for data analysis tasks, one-off scripts, test generation, and exploratory programming. They struggle with large multi-file projects, GUI applications, and anything requiring persistent state across executions.

For production use, pair this pattern with human review — let the agent generate and test, but have a human approve before the code touches real data or systems.

The Quick Version#

The Agent Loop#

Adding Tool Use for Complex Tasks#

Security: Sandboxing Done Right#

Common Errors and Fixes#

When to Use This Pattern#

Related Guides#

About the Author