Give the agent a browser

Every scraping project I’ve built follows the same arc. First, you write a script for the happy path. Then you handle the edge case where the page structure changed. Then the CAPTCHA. Then the consent dialog. Then the bot detection. Then the JavaScript-rendered SPA that returns blank HTML to fetch(). Each fix makes the script more brittle and more complex. By the end, you have a thousand-line automation that breaks if someone moves a button.

The problem is architectural. Traditional scraping tries to predict what the page will look like and encode that prediction into code. But the web is adversarial — pages change, anti-bot measures evolve, layouts differ between regions. You can’t predict it all. You need something that can look at the page and react.

I already have something that can look at a page and react. It’s the AI agent.

The pattern: agent-driven browser

The idea is simple. Instead of building a complex scraping framework, build a thin toolkit — a set of primitive actions — and let the agent drive.

Agent sees screenshot → decides action → tool executes action → new screenshot → agent decides again

The agent is the intelligence. The tool is just hands.

This is browser-tool.ts: a single-file Playwright CLI with ten commands. No framework. No page object models. No CSS selector databases. Just primitives.

# Start a persistent browser
bun run browser-tool.ts launch

# Navigate somewhere
bun run browser-tool.ts goto "https://google.com.sg"

# Look at the page (agent reads the screenshot)
bun run browser-tool.ts screenshot

# Type a search query
bun run browser-tool.ts type "textarea[name=q]" "Allen & Gledhill Singapore"

# Press Enter
bun run browser-tool.ts press Enter

# Extract all links as structured data
bun run browser-tool.ts extract-links

# Scroll down to see more results
bun run browser-tool.ts scroll down

# Done
bun run browser-tool.ts close

That’s it. Ten commands: launch, goto, screenshot, type, click, press, extract-text, extract-links, scroll, close. Every action that changes the page auto-captures a screenshot afterward. The agent reads the screenshot, sees what happened, and decides the next action.

Why a persistent browser matters

The first design decision was architecture. Playwright normally launches a browser, does work, and closes it. Every invocation is a fresh process. But the agent executes commands one at a time — each bun run browser-tool.ts <command> is a separate process. If every command launched a new browser, you’d lose state between actions.

The solution: launch starts Chromium as a background process with a remote debugging port, then saves the PID and WebSocket endpoint to a state file. Every subsequent command connects to the running browser via CDP. close kills the background process and cleans up.

browser-tool.ts launch
  → starts Chromium with --remote-debugging-port=9222
  → saves PID + WS endpoint to .browser-state.json
  → browser keeps running after command exits

browser-tool.ts goto "https://..."
  → reads .browser-state.json
  → connects via CDP
  → navigates, screenshots
  → disconnects (browser stays alive)

browser-tool.ts close
  → reads PID from .browser-state.json
  → kills process
  → deletes state file

The browser persists a profile directory too. Cookies, localStorage, cached sessions — they survive across commands and across agent sessions. Google sees a returning visitor with browsing history, not a fresh bot fingerprint on every request.

CAPTCHAs are a non-problem

This is the part that surprises people. Traditional scraping frameworks treat CAPTCHAs as a hard blocker — you either pay for a solving service or you give up. With an agent-driven browser, CAPTCHAs are just another screenshot.

The agent navigates to Google. Google shows a CAPTCHA. The agent sees the CAPTCHA in the screenshot. It says: “I see a CAPTCHA. Let me try clicking the checkbox.” It clicks. If there’s an image challenge, it describes what it sees. If it can’t solve it, it waits and retries, or notes the failure and moves on.

The same applies to cookie consent banners, age gates, login walls, “Are you a robot?” interstitials. The agent sees them the same way a human does. It doesn’t need special handling code for each one — it just reads the screenshot and adapts.

This eliminates an entire category of scraping infrastructure. No CAPTCHA-solving APIs. No consent banner detection. No anti-bot evasion libraries. The agent’s vision model handles all of it because that’s what vision models are good at — understanding what’s on a screen.

What this is for

I’m building this for sgcaselaw.com — a database of Singapore legal cases, lawyers, and firms. The entity pages (lawyer profiles, firm profiles) need to be the most comprehensive independent profiles on the internet. That means going beyond our case data to include education, career history, awards, practice areas — information that lives on firm websites, legal directories, and competitor pages.

Phase 1 is competitive analysis: searching Google for Singapore lawyer and firm names, observing what ranks, visiting the top results, and cataloging what information types appear on competitor pages. The output is a content gap matrix — what they show that we don’t.

A traditional approach would be: build a Google scraper, build a Chambers & Partners scraper, build a Legal 500 scraper, build a LinkedIn scraper. Each one custom. Each one brittle. Each one breaks differently.

The agent-driven approach: give the agent a browser and a research brief. “Search for these 5 firms on google.com.sg. For each, record the top 10 organic results. Then visit the top competitor pages and document what information types they display.” The agent does the research like a human research assistant would — searching, reading, taking notes — except it runs at 3am and doesn’t get bored.

The same browser-tool.ts then powers Phase 2: scraping firm websites. The agent navigates to each firm’s site, finds the /about, /our-team, /people, /practice pages, extracts the content, and saves it. JavaScript-rendered SPAs? Not a problem — Playwright renders them. Unusual navigation patterns? The agent figures it out from the screenshot.

Composition over complexity

The tool does almost nothing. That’s the point.

Each command is 10-20 lines of Playwright calls. goto navigates and screenshots. click clicks and screenshots. extract-text returns page.innerText('body'). There’s no retry logic, no error recovery, no smart waiting — because the agent handles all of that. If a click doesn’t work, the agent sees the unchanged screenshot and tries something else. If a page is slow to load, the agent takes another screenshot and waits.

This is the same philosophy behind the .plan/ protocol and autopilot: keep the tools simple, let the agent supply the intelligence. The plan files are just markdown. The autopilot wrapper is just a bash loop. The browser tool is just Playwright primitives. The sophistication lives in the agent’s reasoning, not in the infrastructure.

Complex tools create complex failure modes. A scraping framework with built-in retry logic, rate limiting, proxy rotation, and CAPTCHA solving is a system you have to debug. A ten-command CLI is a system you can read in five minutes and trust immediately.

The extraction pipeline

Once browser-tool.ts is proven in SERP research and firm website scraping, the same tool feeds into LLM fact extraction. The agent scrapes a firm’s website, reads the content, and — because the agent is the LLM — extracts structured facts directly:

Scrape (browser-tool.ts) → Raw content → Agent reads content → Structured facts

No separate extraction step. No API call to another model. The same agent that navigated to the page, handled the cookie banner, and extracted the text also understands the text and pulls out “founded in 1902,” “450 lawyers across 10 offices,” “Tier 1 ranking in Dispute Resolution.”

The facts go into a database with provenance tracking — every fact links back to the source URL, the scrape date, and a confidence score. If a firm updates their website, the pipeline re-scrapes, re-extracts, and the facts update. The browser profile is already warm. The agent already knows the site layout from last time.

Ten commands, not ten thousand lines

The entire tool is a single TypeScript file. It could be a hundred lines or three hundred — the point is that it fits in one file, one mental model, one purpose. Compare this to a Scrapy project with pipelines, middlewares, item processors, and spider classes. Or a Puppeteer automation with page objects, retry decorators, and proxy managers.

Those frameworks exist because traditional automation needs to encode intelligence in code. Every edge case becomes a conditional. Every failure mode becomes a handler. The codebase grows proportional to the complexity of the web.

Agent-driven tools don’t have this problem. The complexity stays in the agent’s context window — ephemeral, adaptive, never hardcoded. The tool stays simple because it only needs to provide capability, not intelligence.

Traditional: tool = capability + intelligence
Agent-driven: tool = capability, agent = intelligence

That separation is what makes the tool reusable across completely different tasks — SERP research, firm scraping, competitor analysis, content monitoring — without changing a line of code.

The browser was always the universal interface to the web. Now the agent can use it directly. No scraping framework needed. No prediction about what pages will look like. Just launch, look, decide, act. The same loop a human follows, running at machine speed, at 3am, for two hundred firms in a row.