How to Build a Multimodal AI Agent That Processes Images and Text

The Core Pattern: Vision + Tool Calling in a Loop

A multimodal agent is just a regular tool-calling agent with one key difference: it can see. You feed images into the model alongside text, the model reasons over both, and then it calls tools to act on what it sees. GPT-4o handles this natively through its vision capabilities.

Here’s the minimal agent loop that ties it all together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import openai
import base64
import json
from pathlib import Path

client = openai.OpenAI()

def encode_image(image_path: str) -> str:
    """Base64-encode a local image file."""
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def build_image_message(image_source: str) -> dict:
    """Build a user message with an image from a URL or local path."""
    if image_source.startswith("http"):
        image_content = {
            "type": "image_url",
            "image_url": {"url": image_source},
        }
    else:
        b64 = encode_image(image_source)
        mime = "image/png" if image_source.endswith(".png") else "image/jpeg"
        image_content = {
            "type": "image_url",
            "image_url": {"url": f"data:{mime};base64,{b64}"},
        }
    return {
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this image."},
            image_content,
        ],
    }

This handles both remote URLs and local files. The API accepts base64-encoded images inline, so you don’t need to host anything.

Defining the Tools

The agent needs tools to do something useful with what it sees. Here are three that cover the most common use cases: extracting text from screenshots, describing visual content, and interpreting charts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_text_from_image",
            "description": "Extract all visible text from an image, such as a screenshot or document photo.",
            "parameters": {
                "type": "object",
                "properties": {
                    "image_source": {
                        "type": "string",
                        "description": "URL or local file path of the image",
                    }
                },
                "required": ["image_source"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "describe_image",
            "description": "Generate a detailed description of the visual content in an image.",
            "parameters": {
                "type": "object",
                "properties": {
                    "image_source": {
                        "type": "string",
                        "description": "URL or local file path of the image",
                    },
                    "focus": {
                        "type": "string",
                        "description": "Optional area to focus on, e.g., 'top-left chart' or 'text in the header'",
                    },
                },
                "required": ["image_source"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "interpret_chart",
            "description": "Analyze a chart or graph image and extract key data points, trends, and insights.",
            "parameters": {
                "type": "object",
                "properties": {
                    "image_source": {
                        "type": "string",
                        "description": "URL or local file path of the chart image",
                    },
                    "chart_type": {
                        "type": "string",
                        "description": "Type of chart if known: bar, line, pie, scatter, etc.",
                    },
                },
                "required": ["image_source"],
            },
        },
    },
]

Each tool takes an image source and sends it back to GPT-4o with a specialized prompt. This is the key insight: the same vision model powers both the agent reasoning and the tool implementations. The tools just give structure to what the model extracts.

Implementing the Tool Functions

Each tool function sends a focused vision request to GPT-4o with a specific system prompt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def call_vision(image_source: str, system_prompt: str) -> str:
    """Send an image to GPT-4o with a specialized prompt."""
    msg = build_image_message(image_source)
    msg["content"][0]["text"] = system_prompt

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[msg],
        max_tokens=1024,
    )
    return response.choices[0].message.content

def run_tool(name: str, args: dict) -> str:
    """Dispatch a tool call to the right function."""
    if name == "extract_text_from_image":
        return call_vision(
            args["image_source"],
            "Extract every piece of visible text from this image. "
            "Return it verbatim, preserving layout where possible.",
        )
    elif name == "describe_image":
        focus = args.get("focus", "")
        prompt = "Describe the visual content of this image in detail."
        if focus:
            prompt += f" Focus especially on: {focus}"
        return call_vision(args["image_source"], prompt)
    elif name == "interpret_chart":
        chart_type = args.get("chart_type", "unknown")
        return call_vision(
            args["image_source"],
            f"This image contains a {chart_type} chart. "
            "Extract the key data points, identify trends, and summarize the main takeaway. "
            "If you can read axis labels and values, include them.",
        )
    else:
        return f"Unknown tool: {name}"

All three tools follow the same pattern: take an image, send it to the model with a targeted system prompt, return the result. You can add more tools without changing the agent loop.

The Agent Loop

This is the standard tool-calling loop, adapted for multimodal input. The agent runs until the model stops requesting tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def run_agent(user_text: str, image_source: str = None, max_turns: int = 10):
    """Run the multimodal agent loop."""
    system = {
        "role": "system",
        "content": (
            "You are a multimodal AI agent that can analyze images and text. "
            "Use your tools to extract information from images when needed. "
            "Reason step-by-step about what you see before taking action."
        ),
    }

    if image_source:
        user_msg = build_image_message(image_source)
        user_msg["content"][0]["text"] = user_text
    else:
        user_msg = {"role": "user", "content": user_text}

    messages = [system, user_msg]

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            print(f"Agent: {msg.content}")
            return msg.content

        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            print(f"Calling tool: {tc.function.name}({args})")
            result = run_tool(tc.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result,
            })

    return "Max turns reached."

Try it out:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Analyze a screenshot
run_agent(
    "What text is visible in this screenshot and what app is this from?",
    image_source="./screenshot.png",
)

# Interpret a chart from a URL
run_agent(
    "What are the key trends in this chart?",
    image_source="https://example.com/quarterly-revenue-chart.png",
)

# Process multiple images by chaining calls
run_agent("Compare these two dashboard screenshots and tell me what changed.",
          image_source="./dashboard_v1.png")

How the Vision Message Format Works

The GPT-4o API accepts images in the content array of a message. Each content block has a type field – either text or image_url. You can mix multiple images and text blocks in a single message.

For local files, you base64-encode the image and use a data URI. For remote files, you pass the URL directly. The model handles both PNG and JPEG natively. There’s a token cost associated with image size: larger images use more tokens. If you’re processing lots of screenshots, resize them to under 2048px on the longest side to keep costs down.

One thing worth knowing: the detail parameter on image_url controls quality. Set it to "low" for quick classification tasks or "high" when you need to read small text:

1
2
3
4
5
6
7
image_content = {
    "type": "image_url",
    "image_url": {
        "url": image_url,
        "detail": "high",  # "low", "high", or "auto"
    },
}

The "auto" setting (the default) lets the model decide, which is usually fine.

Common Errors and Fixes

“Invalid image” or 400 error when sending base64 images

Your base64 string probably includes a newline or the MIME type is wrong. Make sure you’re using base64.standard_b64encode (not urlsafe_b64encode) and that the data URI prefix matches the actual file type:

1
2
3
4
5
6
# Wrong: mismatched MIME type
"data:image/png;base64,..."  # but the file is actually a JPEG

# Fix: detect the type from the file extension or use magic bytes
import mimetypes
mime, _ = mimetypes.guess_type(image_path)

“Tool call failed” or tool results not appearing in context

Every tool call response must include the exact tool_call_id from the request. If you miss this, the API rejects the message. Double-check that you’re reading tc.id (not tc.function.name) when building the tool response.

Token limit exceeded with large images

High-detail images can consume 1000+ tokens each. If you’re passing multiple images in one conversation, you’ll hit the context window fast. Two fixes: resize images before encoding, or set detail: "low" for images where you don’t need pixel-level accuracy.

Agent loops forever calling the same tool

This happens when the tool result doesn’t give the model enough information to move forward. Add a max_turns guard (like the one in the loop above) and make your tool responses explicit about what was found or not found. Returning “No text found in this image” is better than returning an empty string.

FileNotFoundError on local image paths

Use Path.resolve() to handle relative paths, and validate the file exists before encoding:

1
2
3
path = Path(image_source).resolve()
if not path.exists():
    return f"Error: file not found at {path}"

The Core Pattern: Vision + Tool Calling in a Loop#

Defining the Tools#

Implementing the Tool Functions#

The Agent Loop#

How the Vision Message Format Works#

Common Errors and Fixes#

Related Guides#

About the Author