The Core Idea

A browser automation agent combines two things: Playwright controls the browser, and an LLM decides what to do next. The agent loop is simple — observe the page, ask the LLM what action to take, execute it, repeat. No hardcoded selectors, no brittle scraping scripts. The LLM reads the page and figures out where to click.

Here’s the full working agent in one shot. We’ll break it down after.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import asyncio
import json
from playwright.async_api import async_playwright
from openai import OpenAI

client = OpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "click",
            "description": "Click an element on the page by its text content or CSS selector.",
            "parameters": {
                "type": "object",
                "properties": {
                    "selector": {
                        "type": "string",
                        "description": "CSS selector or text content to click, e.g. 'a:has-text(\"Login\")' or 'button#submit'"
                    }
                },
                "required": ["selector"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "type_text",
            "description": "Type text into an input field.",
            "parameters": {
                "type": "object",
                "properties": {
                    "selector": {"type": "string", "description": "CSS selector for the input field"},
                    "text": {"type": "string", "description": "Text to type"}
                },
                "required": ["selector", "text"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "navigate",
            "description": "Navigate to a URL.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "Full URL to navigate to"}
                },
                "required": ["url"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "scroll",
            "description": "Scroll the page up or down.",
            "parameters": {
                "type": "object",
                "properties": {
                    "direction": {"type": "string", "enum": ["up", "down"], "description": "Scroll direction"}
                },
                "required": ["direction"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "extract_data",
            "description": "Extract structured data from the current page and return it. Call this when you have found the information the user asked for.",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {"type": "object", "description": "The structured data extracted from the page"}
                },
                "required": ["data"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "done",
            "description": "Signal that the task is complete.",
            "parameters": {"type": "object", "properties": {}}
        }
    }
]


async def get_page_content(page):
    """Extract readable text from the page for the LLM."""
    return await page.evaluate("""
        () => {
            const elements = document.querySelectorAll(
                'h1, h2, h3, p, a, button, input, select, textarea, li, td, th, span, label'
            );
            const lines = [];
            for (const el of elements) {
                const tag = el.tagName.toLowerCase();
                const text = el.innerText?.trim().slice(0, 200);
                if (!text) continue;

                if (tag === 'a') {
                    lines.push(`[link] "${text}" (href=${el.href})`);
                } else if (tag === 'input') {
                    const inputType = el.type || 'text';
                    const name = el.name || el.id || '';
                    const placeholder = el.placeholder || '';
                    lines.push(`[input:${inputType}] name="${name}" placeholder="${placeholder}" selector="${el.id ? '#' + el.id : 'input[name=' + name + ']'}"`);
                } else if (tag === 'button') {
                    lines.push(`[button] "${text}"`);
                } else {
                    lines.push(`[${tag}] ${text}`);
                }
            }
            return lines.join('\\n');
        }
    """)


async def execute_action(page, action_name, args):
    """Run a browser action and return a status message."""
    if action_name == "click":
        selector = args["selector"]
        try:
            await page.click(selector, timeout=5000)
            await page.wait_for_load_state("domcontentloaded", timeout=10000)
        except Exception:
            # Fall back to text-based click
            await page.get_by_text(selector.strip('"').strip("'")).first.click()
            await page.wait_for_load_state("domcontentloaded", timeout=10000)
        return f"Clicked: {selector}"

    elif action_name == "type_text":
        await page.fill(args["selector"], args["text"])
        return f"Typed '{args['text']}' into {args['selector']}"

    elif action_name == "navigate":
        await page.goto(args["url"], wait_until="domcontentloaded", timeout=15000)
        return f"Navigated to {args['url']}"

    elif action_name == "scroll":
        delta = 500 if args["direction"] == "down" else -500
        await page.evaluate(f"window.scrollBy(0, {delta})")
        return f"Scrolled {args['direction']}"

    elif action_name == "extract_data":
        return json.dumps(args["data"], indent=2)

    elif action_name == "done":
        return "DONE"

    return f"Unknown action: {action_name}"


async def run_agent(task, start_url, max_steps=15):
    """Main agent loop: observe, decide, act, repeat."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(start_url, wait_until="domcontentloaded")

        messages = [
            {
                "role": "system",
                "content": (
                    "You are a browser automation agent. You can see the current page content "
                    "and take actions to complete the user's task. Use the provided tools to "
                    "interact with the page. When you have the answer, call extract_data with "
                    "the structured result, then call done."
                )
            },
            {"role": "user", "content": f"Task: {task}"}
        ]

        extracted = None

        for step in range(max_steps):
            page_text = await get_page_content(page)
            current_url = page.url
            observation = f"Current URL: {current_url}\n\nPage content:\n{page_text[:8000]}"
            messages.append({"role": "user", "content": observation})

            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=TOOLS,
                tool_choice="auto"
            )

            msg = response.choices[0].message
            messages.append(msg)

            if not msg.tool_calls:
                print(f"Step {step + 1}: Agent responded with text: {msg.content}")
                break

            for tool_call in msg.tool_calls:
                fn_name = tool_call.function.name
                fn_args = json.loads(tool_call.function.arguments)
                print(f"Step {step + 1}: {fn_name}({fn_args})")

                result = await execute_action(page, fn_name, fn_args)

                if fn_name == "extract_data":
                    extracted = fn_args["data"]

                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": result
                })

                if result == "DONE":
                    await browser.close()
                    return extracted

        await browser.close()
        return extracted

That’s the whole agent. Let’s walk through what makes it work.

Setting Up Playwright

Install the dependencies first:

1
2
pip install playwright openai
playwright install chromium

Playwright’s async API is the right choice here because we’re already in an async context waiting on LLM responses. The async_playwright context manager handles browser lifecycle cleanly — launch a browser, create a page, and the context manager tears everything down when you’re done.

The key Playwright calls you’ll use most:

  • page.goto(url) — navigate to a URL
  • page.click(selector) — click an element
  • page.fill(selector, text) — type into an input
  • page.evaluate(js) — run JavaScript in the browser context
  • page.wait_for_load_state() — wait for page to settle after navigation

Extracting Page Content for the LLM

The get_page_content function is where the magic happens. You can’t just dump page.content() (raw HTML) into the LLM — it’s too noisy and eats your token budget. Instead, we extract a structured text representation that tells the LLM what’s on the page and how to interact with it.

The JavaScript in page.evaluate() walks through visible elements and formats them as tagged lines:

  • [link] "Sign In" (href=/login) — the LLM knows it can click this
  • [input:text] name="email" placeholder="Enter email" — the LLM knows it can type here
  • [button] "Submit" — clickable action element
  • [h1] Product Catalog — structural context

This format gives the LLM enough information to make decisions without overwhelming it. We also cap each element’s text at 200 characters and the total page content at 8,000 characters to stay within reasonable token limits.

The Agent Loop

The agent follows a classic observe-act cycle:

  1. Observe: Extract the current page content and URL
  2. Decide: Send the observation to the LLM with available tools
  3. Act: Execute whatever tool call the LLM returns
  4. Repeat: Feed the action result back and observe the new page state

OpenAI’s function calling (the tools parameter) is what makes this structured. Instead of parsing free-text responses like “click the login button,” the LLM returns a structured JSON tool call: {"name": "click", "arguments": {"selector": "a:has-text(\"Login\")"}}. No regex parsing, no prompt injection risk from the page content leaking into action parsing.

The tool_choice="auto" setting lets the model decide when to call tools and when to respond with text. When the task is done, the agent calls extract_data to return structured results, then done to signal completion.

Practical Example: Scraping Product Data

Here’s how you’d use the agent to extract structured data from a website:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
async def main():
    result = await run_agent(
        task=(
            "Go to https://books.toscrape.com and extract the title and price "
            "of the first 5 books on the page. Return them as a list of objects "
            "with 'title' and 'price' fields."
        ),
        start_url="https://books.toscrape.com",
        max_steps=10
    )
    print("Extracted data:")
    print(json.dumps(result, indent=2))

asyncio.run(main())

The agent will load the page, read the book listings, and call extract_data with something like:

1
2
3
4
5
6
7
8
9
{
  "books": [
    {"title": "A Light in the Attic", "price": "51.77"},
    {"title": "Tipping the Velvet", "price": "53.74"},
    {"title": "Soumission", "price": "50.10"},
    {"title": "Sharp Objects", "price": "47.82"},
    {"title": "Sapiens", "price": "54.23"}
  ]
}

No CSS selectors to maintain. No XPath expressions to update when the site changes layout. The LLM reads the page and figures out where the data lives.

Common Errors and Fixes

TimeoutError: page.click: Timeout 5000ms exceeded The selector doesn’t match any element. Playwright selectors are CSS-based — make sure you’re using the right format. For text-based clicks, use a:has-text("Login") or button:has-text("Submit"). The fallback in execute_action tries page.get_by_text() if the CSS selector fails.

Error: browser.newContext: Browser has been closed You’re trying to use the browser after the async with block exits. Make sure all your agent logic runs inside the async_playwright() context manager. Don’t return a page object and try to use it later.

openai.RateLimitError: Rate limit reached Each step in the agent loop makes an API call. With 15 max steps, that’s 15 calls minimum. Add a await asyncio.sleep(1) between steps if you’re hitting rate limits, or use a cheaper model like gpt-4o-mini for simple navigation tasks.

playwright._impl._errors.TargetClosedError The page navigated away or a popup opened. Use page.wait_for_load_state("domcontentloaded") after clicks that trigger navigation. The execute_action function already does this.

Agent gets stuck in a loop visiting the same pages Cap your max_steps and add the conversation history to the LLM context so it can see what it already tried. The agent above does this by appending every observation and action to the messages list. If it’s still looping, add a system prompt instruction like “Do not revisit URLs you have already visited.”

page.evaluate returns None for dynamic content Single-page apps load content asynchronously. Add await page.wait_for_selector("selector-for-expected-content", timeout=10000) before extracting content, or use await page.wait_for_load_state("networkidle") to wait for all network requests to finish (though this can be slow).